Contents

Test out Aptible AI for your team

Engineering

Engineering

Last Updated

Nov 6, 2024

0 Min Read

How to build an AI Agent for SRE (Part 3)

Eric Abruzzese

Software Engineer, Aptible AI

Eric Abruzzese

Software Engineer, Aptible AI

Engineering

Last Updated

Nov 6, 2024

0 Min Read

How to build an AI Agent for SRE (Part 3)

Eric Abruzzese

Software Engineer, Aptible AI

Build an AI Agent for SRE
Build an AI Agent for SRE

Give your Agent access to your documentation and external tools

Welcome back to your comprehensive guide on how to build your own AI Agent for SRE teams. So far, we’ve covered:

Part 1:

  • How to set up an application in Chainlit

  • How to connect your application to an LLM (in this case, gpt-4o)

Part 2:

  • How to make your Agent faster and add real-time “typing” (similar to ChatGPT)

  • How to give your Agent a personality and specialization

Now in this third and final section, we’ll show you how to:

  1. Give your Agent the ability to search your docs

  2. Integrate your Agent with external tools

Let’s get started with a few considerations and pro tips based on our own experience with Aptible AI.

Considerations and pro tips for building an AI Agent

Up to this point, we’ve covered a generalized setup for your bot, but now it’s time to make it useful for your org. When we started building Aptible AI, we went through a lot of trial and error when it came to setting up our tools and integrations. We covered this in more detail in Part 1 (see section “What’s your integration strategy”), but here are some things to keep in mind particularly with regard to the specific tools and knowledge bases your bot has access to:

1. What information should my Agent have access to?

🤔 Considerations: Before you start feeding your LLM all of the information you can possibly find, think strategically about what exactly you want the Agent to be able to do and what its purpose is. Later on in the guide, we’ll show you how to give your bot the ability to search a collection of files effectively. Just keep in mind that the more information you give it, the trickier it can become to maintain high levels of accuracy.

🛠️ What we built: Aptible AI was built for the very specific purpose of helping our SRE team and on-call engineers to debug and resolve production issues faster. So we’ve given it access to our runbooks and other documentation and we’ve designed it to gather and surface as much relevant information about the incident as it can. It then returns the data to Slack in a dynamic dashboard for our engineers to then investigate.

We’ve found that by simply surfacing the information quickly, the Agent provides instant and consistent value versus trying to answer every single question (thereby increasing the likelihood of AI hallucination).

💡 Pro tip:

Resist the temptation to just plug the LLM into your existing search infrastructure! It may seem simpler, but LLMs perform better if you allow them to search semantically (i.e. using vectors) instead of trying to come up with keywords it thinks might be relevant.

2. How should I set up my tools?

🤔 Considerations: What do you need the Agent to be able to do with integrations? How big of a payload do you anticipate getting back from these tools? In many cases, a “good enough” first pass from the Agent can just involve passing a raw API response payload back to the LLM (just be careful thta you don’t blow out the context window if the response is really big). Another thing to consider is the granularity of your tool calls.

Our recommendation is to go as granular as you can. if you can teach your Agent to use the relevant parts of an API individually and instruct it on when it should use each one, then it can build its own chain of operations. It might surprise you how creative it can be when solving problems!

🛠️ What we built: Because tools are at the heart of what makes the AI Agent useful for our team, we’ve spent a lot of time developing how our integrations work. We’ve designed the bot to gather more information from the user(s) during an incident in order to make the most relevant tool calls to surface the most helpful information. Sometimes that means that the bot asks follow-up questions, but generally it means that the bot continues to make various tool calls until it has deemed the response helpful (it achieves this by using a self-rating system that improves over time).

💡 Pro tip:

The LLM is smarter than you might think; avoid over-engineering your tools by trying to parse API responses or build our nice markdown documents. In most cases, that’s not necessary. Also (bonus pro tip): beware that too many tools can make it hard for the LLM to choose what to do next, so it's a delicate balance: only implement the tools that are useful, and be thoughtful about instructing the LLM on usage.

Now that we’ve covered the considerations, let’s dive in to your final 2 steps for building your AI Agent for SRE teams! Please keep in mind that if you haven’t completed the steps from Parts 1 and 2 of this guide, then you need to do so before you continue.

Hands-on lab, part 3: make your AI Agent more useful with your docs and external tools

5. Knowledge

Now that we’re using Assistants and Threads, we can start customizing behavior. To start, we’ll give our Assistant access to some of our internal documentation so that it can provide responses that are more tailored to our use case.

The Goal

By the end of this section, you’ll have given your bot the ability to search a collection of files (e.g., your SRE runbooks and other internal documentation) when responding to prompts.

For simplicity, we’ll implement this as a folder full of files that get uploaded into a vector store and provided to our Assistant.

Creating a Vector Store

The first thing we need to do is create a vector store and provide it to our Assistant.

First, update the beginning of our handle_chat_start function to include the following:

# ...
@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None
    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break
    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )
    
    assistant = None
    # ...
# ...

Next, update the call to client.beta.assistants.update() to give the assistant access to the vector store and enable the file_search tool.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[{"type": "file_search"}],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Upload Documentation

Finally, we’ll need to upload our documentation that we want our assistant to reference when answering prompts.

First, we’ll need to create a folder where we’ll put our documents:

mkdir

Next, we’ll collect our documentation and put it in that folder. For testing purposes, I’ve added the following fake document to my folder:

---
title: SRE Runbook
url: <https://internal.aptible.com/docs/runbooks/sre>

Finally, we’ll update our handle_chat_start function to automatically upload our documents to the vector store we created earlier. Add the following code just after where we create the vector store:

if documents := list(Path("./docs").glob("**/*.md")):
  await client.beta.vector_stores.file_batches.upload_and_poll(
      vector_store_id=vector_store.id,
      files=(f.open("rb") for f in documents),
  )

ℹ️ Note:

For now, we’ll only support .md files, but OpenAI supports lots of different file types, so feel free to update the glob pattern to whatever makes sense for your use case!

This will automatically upload all of the files in the ./docs folder and add them to our vector store.

Add an indicator

File search can sometimes take a while, especially for larger datasets. In those cases, you’ll probably want to let the user know what’s going on so they don’t get frustrated.

Luckily, Chainlit makes this easy by providing a Step class that we can use to tell the user that something’s happening in the background. We can use the Step class in conjunction with the MessageEventHandler we built earlier, and add an indicator any time a tool is called.

Add the following to your MessageEventHandler:

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step
Try it out

Now that you’ve uploaded some of your own documentation, try asking some questions that are more specific to your use case, and see what you get!

For our test case, it correctly referenced our runbook when asked about high CPU utilization on a customer database:

🧑‍💻 For reference, here's the complete code so far:

import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta.threads.runs.tool_call import ToolCall

##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.
"""

client = AsyncOpenAI(api_key=OPENAI_API_KEY)

class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[{"type": "file_search"}],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)

@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()

Give your Agent access to your documentation and external tools

Welcome back to your comprehensive guide on how to build your own AI Agent for SRE teams. So far, we’ve covered:

Part 1:

  • How to set up an application in Chainlit

  • How to connect your application to an LLM (in this case, gpt-4o)

Part 2:

  • How to make your Agent faster and add real-time “typing” (similar to ChatGPT)

  • How to give your Agent a personality and specialization

Now in this third and final section, we’ll show you how to:

  1. Give your Agent the ability to search your docs

  2. Integrate your Agent with external tools

Let’s get started with a few considerations and pro tips based on our own experience with Aptible AI.

Considerations and pro tips for building an AI Agent

Up to this point, we’ve covered a generalized setup for your bot, but now it’s time to make it useful for your org. When we started building Aptible AI, we went through a lot of trial and error when it came to setting up our tools and integrations. We covered this in more detail in Part 1 (see section “What’s your integration strategy”), but here are some things to keep in mind particularly with regard to the specific tools and knowledge bases your bot has access to:

1. What information should my Agent have access to?

🤔 Considerations: Before you start feeding your LLM all of the information you can possibly find, think strategically about what exactly you want the Agent to be able to do and what its purpose is. Later on in the guide, we’ll show you how to give your bot the ability to search a collection of files effectively. Just keep in mind that the more information you give it, the trickier it can become to maintain high levels of accuracy.

🛠️ What we built: Aptible AI was built for the very specific purpose of helping our SRE team and on-call engineers to debug and resolve production issues faster. So we’ve given it access to our runbooks and other documentation and we’ve designed it to gather and surface as much relevant information about the incident as it can. It then returns the data to Slack in a dynamic dashboard for our engineers to then investigate.

We’ve found that by simply surfacing the information quickly, the Agent provides instant and consistent value versus trying to answer every single question (thereby increasing the likelihood of AI hallucination).

💡 Pro tip:

Resist the temptation to just plug the LLM into your existing search infrastructure! It may seem simpler, but LLMs perform better if you allow them to search semantically (i.e. using vectors) instead of trying to come up with keywords it thinks might be relevant.

2. How should I set up my tools?

🤔 Considerations: What do you need the Agent to be able to do with integrations? How big of a payload do you anticipate getting back from these tools? In many cases, a “good enough” first pass from the Agent can just involve passing a raw API response payload back to the LLM (just be careful thta you don’t blow out the context window if the response is really big). Another thing to consider is the granularity of your tool calls.

Our recommendation is to go as granular as you can. if you can teach your Agent to use the relevant parts of an API individually and instruct it on when it should use each one, then it can build its own chain of operations. It might surprise you how creative it can be when solving problems!

🛠️ What we built: Because tools are at the heart of what makes the AI Agent useful for our team, we’ve spent a lot of time developing how our integrations work. We’ve designed the bot to gather more information from the user(s) during an incident in order to make the most relevant tool calls to surface the most helpful information. Sometimes that means that the bot asks follow-up questions, but generally it means that the bot continues to make various tool calls until it has deemed the response helpful (it achieves this by using a self-rating system that improves over time).

💡 Pro tip:

The LLM is smarter than you might think; avoid over-engineering your tools by trying to parse API responses or build our nice markdown documents. In most cases, that’s not necessary. Also (bonus pro tip): beware that too many tools can make it hard for the LLM to choose what to do next, so it's a delicate balance: only implement the tools that are useful, and be thoughtful about instructing the LLM on usage.

Now that we’ve covered the considerations, let’s dive in to your final 2 steps for building your AI Agent for SRE teams! Please keep in mind that if you haven’t completed the steps from Parts 1 and 2 of this guide, then you need to do so before you continue.

Hands-on lab, part 3: make your AI Agent more useful with your docs and external tools

5. Knowledge

Now that we’re using Assistants and Threads, we can start customizing behavior. To start, we’ll give our Assistant access to some of our internal documentation so that it can provide responses that are more tailored to our use case.

The Goal

By the end of this section, you’ll have given your bot the ability to search a collection of files (e.g., your SRE runbooks and other internal documentation) when responding to prompts.

For simplicity, we’ll implement this as a folder full of files that get uploaded into a vector store and provided to our Assistant.

Creating a Vector Store

The first thing we need to do is create a vector store and provide it to our Assistant.

First, update the beginning of our handle_chat_start function to include the following:

# ...
@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None
    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break
    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )
    
    assistant = None
    # ...
# ...

Next, update the call to client.beta.assistants.update() to give the assistant access to the vector store and enable the file_search tool.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[{"type": "file_search"}],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Upload Documentation

Finally, we’ll need to upload our documentation that we want our assistant to reference when answering prompts.

First, we’ll need to create a folder where we’ll put our documents:

mkdir

Next, we’ll collect our documentation and put it in that folder. For testing purposes, I’ve added the following fake document to my folder:

---
title: SRE Runbook
url: <https://internal.aptible.com/docs/runbooks/sre>

Finally, we’ll update our handle_chat_start function to automatically upload our documents to the vector store we created earlier. Add the following code just after where we create the vector store:

if documents := list(Path("./docs").glob("**/*.md")):
  await client.beta.vector_stores.file_batches.upload_and_poll(
      vector_store_id=vector_store.id,
      files=(f.open("rb") for f in documents),
  )

ℹ️ Note:

For now, we’ll only support .md files, but OpenAI supports lots of different file types, so feel free to update the glob pattern to whatever makes sense for your use case!

This will automatically upload all of the files in the ./docs folder and add them to our vector store.

Add an indicator

File search can sometimes take a while, especially for larger datasets. In those cases, you’ll probably want to let the user know what’s going on so they don’t get frustrated.

Luckily, Chainlit makes this easy by providing a Step class that we can use to tell the user that something’s happening in the background. We can use the Step class in conjunction with the MessageEventHandler we built earlier, and add an indicator any time a tool is called.

Add the following to your MessageEventHandler:

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step
Try it out

Now that you’ve uploaded some of your own documentation, try asking some questions that are more specific to your use case, and see what you get!

For our test case, it correctly referenced our runbook when asked about high CPU utilization on a customer database:

🧑‍💻 For reference, here's the complete code so far:

import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta.threads.runs.tool_call import ToolCall

##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.
"""

client = AsyncOpenAI(api_key=OPENAI_API_KEY)

class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[{"type": "file_search"}],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)

@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()

Give your Agent access to your documentation and external tools

Welcome back to your comprehensive guide on how to build your own AI Agent for SRE teams. So far, we’ve covered:

Part 1:

  • How to set up an application in Chainlit

  • How to connect your application to an LLM (in this case, gpt-4o)

Part 2:

  • How to make your Agent faster and add real-time “typing” (similar to ChatGPT)

  • How to give your Agent a personality and specialization

Now in this third and final section, we’ll show you how to:

  1. Give your Agent the ability to search your docs

  2. Integrate your Agent with external tools

Let’s get started with a few considerations and pro tips based on our own experience with Aptible AI.

Considerations and pro tips for building an AI Agent

Up to this point, we’ve covered a generalized setup for your bot, but now it’s time to make it useful for your org. When we started building Aptible AI, we went through a lot of trial and error when it came to setting up our tools and integrations. We covered this in more detail in Part 1 (see section “What’s your integration strategy”), but here are some things to keep in mind particularly with regard to the specific tools and knowledge bases your bot has access to:

1. What information should my Agent have access to?

🤔 Considerations: Before you start feeding your LLM all of the information you can possibly find, think strategically about what exactly you want the Agent to be able to do and what its purpose is. Later on in the guide, we’ll show you how to give your bot the ability to search a collection of files effectively. Just keep in mind that the more information you give it, the trickier it can become to maintain high levels of accuracy.

🛠️ What we built: Aptible AI was built for the very specific purpose of helping our SRE team and on-call engineers to debug and resolve production issues faster. So we’ve given it access to our runbooks and other documentation and we’ve designed it to gather and surface as much relevant information about the incident as it can. It then returns the data to Slack in a dynamic dashboard for our engineers to then investigate.

We’ve found that by simply surfacing the information quickly, the Agent provides instant and consistent value versus trying to answer every single question (thereby increasing the likelihood of AI hallucination).

💡 Pro tip:

Resist the temptation to just plug the LLM into your existing search infrastructure! It may seem simpler, but LLMs perform better if you allow them to search semantically (i.e. using vectors) instead of trying to come up with keywords it thinks might be relevant.

2. How should I set up my tools?

🤔 Considerations: What do you need the Agent to be able to do with integrations? How big of a payload do you anticipate getting back from these tools? In many cases, a “good enough” first pass from the Agent can just involve passing a raw API response payload back to the LLM (just be careful thta you don’t blow out the context window if the response is really big). Another thing to consider is the granularity of your tool calls.

Our recommendation is to go as granular as you can. if you can teach your Agent to use the relevant parts of an API individually and instruct it on when it should use each one, then it can build its own chain of operations. It might surprise you how creative it can be when solving problems!

🛠️ What we built: Because tools are at the heart of what makes the AI Agent useful for our team, we’ve spent a lot of time developing how our integrations work. We’ve designed the bot to gather more information from the user(s) during an incident in order to make the most relevant tool calls to surface the most helpful information. Sometimes that means that the bot asks follow-up questions, but generally it means that the bot continues to make various tool calls until it has deemed the response helpful (it achieves this by using a self-rating system that improves over time).

💡 Pro tip:

The LLM is smarter than you might think; avoid over-engineering your tools by trying to parse API responses or build our nice markdown documents. In most cases, that’s not necessary. Also (bonus pro tip): beware that too many tools can make it hard for the LLM to choose what to do next, so it's a delicate balance: only implement the tools that are useful, and be thoughtful about instructing the LLM on usage.

Now that we’ve covered the considerations, let’s dive in to your final 2 steps for building your AI Agent for SRE teams! Please keep in mind that if you haven’t completed the steps from Parts 1 and 2 of this guide, then you need to do so before you continue.

Hands-on lab, part 3: make your AI Agent more useful with your docs and external tools

5. Knowledge

Now that we’re using Assistants and Threads, we can start customizing behavior. To start, we’ll give our Assistant access to some of our internal documentation so that it can provide responses that are more tailored to our use case.

The Goal

By the end of this section, you’ll have given your bot the ability to search a collection of files (e.g., your SRE runbooks and other internal documentation) when responding to prompts.

For simplicity, we’ll implement this as a folder full of files that get uploaded into a vector store and provided to our Assistant.

Creating a Vector Store

The first thing we need to do is create a vector store and provide it to our Assistant.

First, update the beginning of our handle_chat_start function to include the following:

# ...
@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None
    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break
    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )
    
    assistant = None
    # ...
# ...

Next, update the call to client.beta.assistants.update() to give the assistant access to the vector store and enable the file_search tool.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[{"type": "file_search"}],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Upload Documentation

Finally, we’ll need to upload our documentation that we want our assistant to reference when answering prompts.

First, we’ll need to create a folder where we’ll put our documents:

mkdir

Next, we’ll collect our documentation and put it in that folder. For testing purposes, I’ve added the following fake document to my folder:

---
title: SRE Runbook
url: <https://internal.aptible.com/docs/runbooks/sre>

Finally, we’ll update our handle_chat_start function to automatically upload our documents to the vector store we created earlier. Add the following code just after where we create the vector store:

if documents := list(Path("./docs").glob("**/*.md")):
  await client.beta.vector_stores.file_batches.upload_and_poll(
      vector_store_id=vector_store.id,
      files=(f.open("rb") for f in documents),
  )

ℹ️ Note:

For now, we’ll only support .md files, but OpenAI supports lots of different file types, so feel free to update the glob pattern to whatever makes sense for your use case!

This will automatically upload all of the files in the ./docs folder and add them to our vector store.

Add an indicator

File search can sometimes take a while, especially for larger datasets. In those cases, you’ll probably want to let the user know what’s going on so they don’t get frustrated.

Luckily, Chainlit makes this easy by providing a Step class that we can use to tell the user that something’s happening in the background. We can use the Step class in conjunction with the MessageEventHandler we built earlier, and add an indicator any time a tool is called.

Add the following to your MessageEventHandler:

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step
Try it out

Now that you’ve uploaded some of your own documentation, try asking some questions that are more specific to your use case, and see what you get!

For our test case, it correctly referenced our runbook when asked about high CPU utilization on a customer database:

🧑‍💻 For reference, here's the complete code so far:

import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta.threads.runs.tool_call import ToolCall

##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.
"""

client = AsyncOpenAI(api_key=OPENAI_API_KEY)

class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[{"type": "file_search"}],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)

@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()

Test out Aptible AI for your team

6. Tools

Our Agent is now able to retrieve data from our curated internal documentation, which is useful if you have good documentation. However, a lot of time in incident management is often spent investigating things that aren’t covered by the docs: scanning alerts, reading logs, interpreting metrics, etc.

For those things, we want to give our assistant the ability to call external APIs — and more broadly, execute functions that we define — so that it can gather more context on an as-needed basis.

To do this, we’ll leverage the “function calling” capabilities of the model to execute functions that we define.

The Goal

By the end of this article, you’ll have given your bot the ability to use an external tool (a fake PagerDuty tool) to retrieve information when answering prompts.

Define the (fake) Tool

First, let’s add a new function to our app.py called get_pagerduty_alert_details.

import json
from datetime import datetime, timedelta

# ...

async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps({
        "alert": {
            "id": "PT4KHLK",
            "type": "alert",
            "summary": "A customer database is experiencing high CPU usage.",
            "self": pagerduty_alert_url,
            "html_url": pagerduty_alert_url,
            "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
            "status": "resolved",
            "alert_key": "baf7cf21b1da41b4b0221008339ff357",
            "suppressed": False,
            "severity": "critical",
        }
    })
Tell the LLM how to use it

Next, we need to tell the LLM how to call our tool. OpenAI expects tool definitions in JSONSchema format.

Update your call to client.beta.assistants.update() to include a new tool definition after the file_search tool that we already have.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[
        # Our existing file search tool
        {"type": "file_search"},
        # Our new pagerduty alert tool
        {
            "type": "function",
            "function": {
                "name": get_pagerduty_alert_details.__name__,
                "description": get_pagerduty_alert_details.__doc__,
                "parameters": {
                    "type": "object",
                    "properties": {
                        "pagerduty_alert_url": {
                            "type": "string",
                            "format": "uri",
                            "description": "The PagerDuty alert URL. For example, '<https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL>'.",
                        },
                    },
                    "required": ["pagerduty_alert_url"],
                    "additionalProperties": False,
                },
            }
        }
    ],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Update the message handler to handle tool calls

Our MessageEventHandler currently handles back-and-forth message events, but calling tools requires some special handling.

When responding to your prompt, the LLM will decide which tools it should call (if any) and return to you one or more “tool call” definitions in the response payload, and tell you that the response “requires action”. In order to actually execute the function, we need to handle these “requires action” responses.

We can do this by updating our MessageEventHandler class to implement the on_event method, along with a new handle_requires_action method for executing our function call and adding the result to the running thread:

from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads import Run

# ...

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)
    
    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()
  
    # ...
Tweak the prompt

It can often be helpful to remind the LLM that it should attempt to use the tools you’ve provided when applicable. Add a line like this one to the end of your prompt:

Use the provided tools to gather additional context about the incident, if applicable.

Try it out

With your tools configured, you’ll be able to include PagerDuty links in your prompts, and the LLM will use those tools to gather context before answering:

🧑‍💻 Here's the complete code:

from datetime import datetime, timedelta
import json
import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads.runs.tool_call import ToolCall
from openai.types.beta.threads import Run


##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.

Use the provided tools to gather additional context about the incident, if
applicable.
"""


client = AsyncOpenAI(api_key=OPENAI_API_KEY)


class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)

    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()


@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[
            # Our existing file search tool
            {"type": "file_search"},
            # Our new pagerduty alert tool
            {
                "type": "function",
                "function": {
                    "name": get_pagerduty_alert_details.__name__,
                    "description": get_pagerduty_alert_details.__doc__,
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "pagerduty_alert_url": {
                                "type": "string",
                                "format": "uri",
                                "description": "The PagerDuty alert URL. For example, 'https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL'.",
                            },
                        },
                        "required": ["pagerduty_alert_url"],
                        "additionalProperties": False,
                    },
                },
            },
        ],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)


@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()


async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps(
        {
            "alert": {
                "id": "PT4KHLK",
                "type": "alert",
                "summary": "A customer database is experiencing high CPU usage.",
                "self": pagerduty_alert_url,
                "html_url": pagerduty_alert_url,
                "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
                "status": "resolved",
                "alert_key": "baf7cf21b1da41b4b0221008339ff357",
                "suppressed": False,
                "severity": "critical",
            }
        }
    )

Now you're all set to build a useful AI Agent for your SRE team! If you have any questions about anything we've covered in this guide, please reach out, and we'll be happy to help. In the meantime, if there is anything missing or any other AI Agent-related thing you'd like to learn, let us know!

6. Tools

Our Agent is now able to retrieve data from our curated internal documentation, which is useful if you have good documentation. However, a lot of time in incident management is often spent investigating things that aren’t covered by the docs: scanning alerts, reading logs, interpreting metrics, etc.

For those things, we want to give our assistant the ability to call external APIs — and more broadly, execute functions that we define — so that it can gather more context on an as-needed basis.

To do this, we’ll leverage the “function calling” capabilities of the model to execute functions that we define.

The Goal

By the end of this article, you’ll have given your bot the ability to use an external tool (a fake PagerDuty tool) to retrieve information when answering prompts.

Define the (fake) Tool

First, let’s add a new function to our app.py called get_pagerduty_alert_details.

import json
from datetime import datetime, timedelta

# ...

async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps({
        "alert": {
            "id": "PT4KHLK",
            "type": "alert",
            "summary": "A customer database is experiencing high CPU usage.",
            "self": pagerduty_alert_url,
            "html_url": pagerduty_alert_url,
            "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
            "status": "resolved",
            "alert_key": "baf7cf21b1da41b4b0221008339ff357",
            "suppressed": False,
            "severity": "critical",
        }
    })
Tell the LLM how to use it

Next, we need to tell the LLM how to call our tool. OpenAI expects tool definitions in JSONSchema format.

Update your call to client.beta.assistants.update() to include a new tool definition after the file_search tool that we already have.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[
        # Our existing file search tool
        {"type": "file_search"},
        # Our new pagerduty alert tool
        {
            "type": "function",
            "function": {
                "name": get_pagerduty_alert_details.__name__,
                "description": get_pagerduty_alert_details.__doc__,
                "parameters": {
                    "type": "object",
                    "properties": {
                        "pagerduty_alert_url": {
                            "type": "string",
                            "format": "uri",
                            "description": "The PagerDuty alert URL. For example, '<https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL>'.",
                        },
                    },
                    "required": ["pagerduty_alert_url"],
                    "additionalProperties": False,
                },
            }
        }
    ],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Update the message handler to handle tool calls

Our MessageEventHandler currently handles back-and-forth message events, but calling tools requires some special handling.

When responding to your prompt, the LLM will decide which tools it should call (if any) and return to you one or more “tool call” definitions in the response payload, and tell you that the response “requires action”. In order to actually execute the function, we need to handle these “requires action” responses.

We can do this by updating our MessageEventHandler class to implement the on_event method, along with a new handle_requires_action method for executing our function call and adding the result to the running thread:

from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads import Run

# ...

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)
    
    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()
  
    # ...
Tweak the prompt

It can often be helpful to remind the LLM that it should attempt to use the tools you’ve provided when applicable. Add a line like this one to the end of your prompt:

Use the provided tools to gather additional context about the incident, if applicable.

Try it out

With your tools configured, you’ll be able to include PagerDuty links in your prompts, and the LLM will use those tools to gather context before answering:

🧑‍💻 Here's the complete code:

from datetime import datetime, timedelta
import json
import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads.runs.tool_call import ToolCall
from openai.types.beta.threads import Run


##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.

Use the provided tools to gather additional context about the incident, if
applicable.
"""


client = AsyncOpenAI(api_key=OPENAI_API_KEY)


class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)

    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()


@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[
            # Our existing file search tool
            {"type": "file_search"},
            # Our new pagerduty alert tool
            {
                "type": "function",
                "function": {
                    "name": get_pagerduty_alert_details.__name__,
                    "description": get_pagerduty_alert_details.__doc__,
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "pagerduty_alert_url": {
                                "type": "string",
                                "format": "uri",
                                "description": "The PagerDuty alert URL. For example, 'https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL'.",
                            },
                        },
                        "required": ["pagerduty_alert_url"],
                        "additionalProperties": False,
                    },
                },
            },
        ],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)


@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()


async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps(
        {
            "alert": {
                "id": "PT4KHLK",
                "type": "alert",
                "summary": "A customer database is experiencing high CPU usage.",
                "self": pagerduty_alert_url,
                "html_url": pagerduty_alert_url,
                "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
                "status": "resolved",
                "alert_key": "baf7cf21b1da41b4b0221008339ff357",
                "suppressed": False,
                "severity": "critical",
            }
        }
    )

Now you're all set to build a useful AI Agent for your SRE team! If you have any questions about anything we've covered in this guide, please reach out, and we'll be happy to help. In the meantime, if there is anything missing or any other AI Agent-related thing you'd like to learn, let us know!

6. Tools

Our Agent is now able to retrieve data from our curated internal documentation, which is useful if you have good documentation. However, a lot of time in incident management is often spent investigating things that aren’t covered by the docs: scanning alerts, reading logs, interpreting metrics, etc.

For those things, we want to give our assistant the ability to call external APIs — and more broadly, execute functions that we define — so that it can gather more context on an as-needed basis.

To do this, we’ll leverage the “function calling” capabilities of the model to execute functions that we define.

The Goal

By the end of this article, you’ll have given your bot the ability to use an external tool (a fake PagerDuty tool) to retrieve information when answering prompts.

Define the (fake) Tool

First, let’s add a new function to our app.py called get_pagerduty_alert_details.

import json
from datetime import datetime, timedelta

# ...

async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps({
        "alert": {
            "id": "PT4KHLK",
            "type": "alert",
            "summary": "A customer database is experiencing high CPU usage.",
            "self": pagerduty_alert_url,
            "html_url": pagerduty_alert_url,
            "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
            "status": "resolved",
            "alert_key": "baf7cf21b1da41b4b0221008339ff357",
            "suppressed": False,
            "severity": "critical",
        }
    })
Tell the LLM how to use it

Next, we need to tell the LLM how to call our tool. OpenAI expects tool definitions in JSONSchema format.

Update your call to client.beta.assistants.update() to include a new tool definition after the file_search tool that we already have.

assistant = await client.beta.assistants.update(
    assistant_id=assistant.id,
    instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
    tools=[
        # Our existing file search tool
        {"type": "file_search"},
        # Our new pagerduty alert tool
        {
            "type": "function",
            "function": {
                "name": get_pagerduty_alert_details.__name__,
                "description": get_pagerduty_alert_details.__doc__,
                "parameters": {
                    "type": "object",
                    "properties": {
                        "pagerduty_alert_url": {
                            "type": "string",
                            "format": "uri",
                            "description": "The PagerDuty alert URL. For example, '<https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL>'.",
                        },
                    },
                    "required": ["pagerduty_alert_url"],
                    "additionalProperties": False,
                },
            }
        }
    ],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vector_store.id],
        }
    },
)
Update the message handler to handle tool calls

Our MessageEventHandler currently handles back-and-forth message events, but calling tools requires some special handling.

When responding to your prompt, the LLM will decide which tools it should call (if any) and return to you one or more “tool call” definitions in the response payload, and tell you that the response “requires action”. In order to actually execute the function, we need to handle these “requires action” responses.

We can do this by updating our MessageEventHandler class to implement the on_event method, along with a new handle_requires_action method for executing our function call and adding the result to the running thread:

from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads import Run

# ...

class MessageEventHandler(AsyncAssistantEventHandler):
    # ...

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)
    
    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()
  
    # ...
Tweak the prompt

It can often be helpful to remind the LLM that it should attempt to use the tools you’ve provided when applicable. Add a line like this one to the end of your prompt:

Use the provided tools to gather additional context about the incident, if applicable.

Try it out

With your tools configured, you’ll be able to include PagerDuty links in your prompts, and the LLM will use those tools to gather context before answering:

🧑‍💻 Here's the complete code:

from datetime import datetime, timedelta
import json
import os
from pathlib import Path
from typing import override
import chainlit as cl

from openai import AsyncOpenAI, AsyncAssistantEventHandler
from openai.types.beta.threads import Message, TextDelta, Text
from openai.types.beta import AssistantStreamEvent
from openai.types.beta.threads.runs.tool_call import ToolCall
from openai.types.beta.threads import Run


##
# Settings
#
try:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
except KeyError as ex:
    raise LookupError(f"Missing required environment variable: {ex}")

# Give your assistant a name.
OPENAI_ASSISTANT_NAME = "roger"

# Give your assistant some custom instructions.
OPENAI_ASSISTANT_INSTRUCTIONS = """
You are an expert Site Reliability Engineer, tasked with helping
the SRE team respond to and resolve incidents.

If you are presented with a question that does not seem like it
could be related to infrastructure, begin your response with a polite
reminder that your primary responsibilities are to help with incident
response, before fully answering the question to the best of your ability.

Use the provided tools to gather additional context about the incident, if
applicable.
"""


client = AsyncOpenAI(api_key=OPENAI_API_KEY)


class MessageEventHandler(AsyncAssistantEventHandler):
    """An event handler for updating a Chainlit message while streaming an OpenAI response."""

    message: cl.Message

    @override
    async def on_text_created(self, text: Text) -> None:
        """Create a new message so that we can update it."""
        self.message = cl.Message(content="")
        await self.message.send()

    @override
    async def on_text_delta(self, delta: TextDelta, snapshot: Text) -> None:
        """Update the message with the latest text delta streamed to us."""
        await self.message.stream_token(delta.value)

    @override
    async def on_message_done(self, message: Message) -> None:
        """Update the message with the final text when the stream completes."""
        await self.message.update()

    @override
    async def on_tool_call_created(self, tool_call: ToolCall) -> None:
        """Create a new step in the conversation to indicate that a tool is being used."""
        async with cl.Step(tool_call.type) as step:
            self.step = step

    @override
    async def on_event(self, event: AssistantStreamEvent) -> None:
        """Handle specific events."""
        if event.event == "thread.run.requires_action":
            run_id = event.data.id  # Retrieve the run ID from the event data
            self.current_run.id = run_id
            await self.handle_requires_action(event.data, run_id)

    async def handle_requires_action(self, run: Run, run_id: str) -> None:
        """Handle events that require an action to be taken, like tool calls."""
        tool_outputs = []

        # Execute each tool call and collect the result.
        for tool in run.required_action.submit_tool_outputs.tool_calls:
            func_name = tool.function.name
            func_args = tool.function.arguments

            # TODO: Build a function map with a decorator instead of looking up
            # functions in globals.
            if func_to_call := globals()[func_name]:
                try:
                    # Parse the func_args JSON string to a dictionary
                    tool_outputs.append(
                        {
                            "tool_call_id": tool.id,
                            "output": await func_to_call(**json.loads(func_args)),
                        }
                    )
                except TypeError as ex:
                    print(f"Error calling function {func_name!r}: {str(ex)}")
            else:
                print(f"Function {func_name!r} not found")

        # Submit tool outputs to the conversation thread.
        async with client.beta.threads.runs.submit_tool_outputs_stream(
            thread_id=self.current_run.thread_id,
            run_id=run_id,
            tool_outputs=tool_outputs,
            event_handler=MessageEventHandler(),
        ) as stream:
            await stream.until_done()


@cl.on_chat_start
async def handle_chat_start() -> str:
    vector_store = None

    # Try to find an existing vector store so we don't create duplicates.
    async for existing_vector_store in await client.beta.vector_stores.list():
        if existing_vector_store.name == OPENAI_ASSISTANT_NAME:
            vector_store = existing_vector_store
            break

    # Create a vector store if we didn't find an existing one.
    vector_store = vector_store or await client.beta.vector_stores.create(
        name=OPENAI_ASSISTANT_NAME,
    )

    if documents := list(Path("./docs").glob("**/*.md")):
        await client.beta.vector_stores.file_batches.upload_and_poll(
            vector_store_id=vector_store.id,
            files=(f.open("rb") for f in documents),
        )

    assistant = None

    # Try to find an existing assistant so we don't create duplicates.
    async for existing_assistant in await client.beta.assistants.list():
        if existing_assistant.name == OPENAI_ASSISTANT_NAME:
            assistant = existing_assistant
            break

    # Create an assistant if we didn't find an existing one.
    assistant = assistant or await client.beta.assistants.create(
        name=OPENAI_ASSISTANT_NAME,
        model="gpt-4o",
    )

    # Update the assistant so that it always has the latest instructions
    assistant = await client.beta.assistants.update(
        assistant_id=assistant.id,
        instructions=OPENAI_ASSISTANT_INSTRUCTIONS,
        tools=[
            # Our existing file search tool
            {"type": "file_search"},
            # Our new pagerduty alert tool
            {
                "type": "function",
                "function": {
                    "name": get_pagerduty_alert_details.__name__,
                    "description": get_pagerduty_alert_details.__doc__,
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "pagerduty_alert_url": {
                                "type": "string",
                                "format": "uri",
                                "description": "The PagerDuty alert URL. For example, 'https://example.pagerduty.com/alerts/Q3YDP8VKEZ9THL'.",
                            },
                        },
                        "required": ["pagerduty_alert_url"],
                        "additionalProperties": False,
                    },
                },
            },
        ],
        tool_resources={
            "file_search": {
                "vector_store_ids": [vector_store.id],
            }
        },
    )

    # Create a thread for the conversation
    thread = await client.beta.threads.create()

    # Add the assistant and the new thread to the user session so that
    # we can reference it in other handlers.
    cl.user_session.set("assistant", assistant)
    cl.user_session.set("thread", thread)


@cl.on_message
async def handle_message(message: cl.Message) -> None:
    # Retrieve our Assistant and Thread from our user session.
    assistant = cl.user_session.get("assistant")
    thread = cl.user_session.get("thread")

    # Add the latest message to the thread.
    await client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=message.content,
    )

    # Stream a response to the Thread (called a "Run") using our Assistant.
    async with client.beta.threads.runs.stream(
        assistant_id=assistant.id,
        thread_id=thread.id,
        # Use our custom message handler.
        event_handler=MessageEventHandler(),
    ) as stream:
        await stream.until_done()


async def get_pagerduty_alert_details(pagerduty_alert_url: str) -> dict:
    """Return the details of the PagerDuty alert at the given URL."""
    # TODO: Make this implementation real!
    return json.dumps(
        {
            "alert": {
                "id": "PT4KHLK",
                "type": "alert",
                "summary": "A customer database is experiencing high CPU usage.",
                "self": pagerduty_alert_url,
                "html_url": pagerduty_alert_url,
                "created_at": (datetime.now() - timedelta(minutes=5)).isoformat(),
                "status": "resolved",
                "alert_key": "baf7cf21b1da41b4b0221008339ff357",
                "suppressed": False,
                "severity": "critical",
            }
        }
    )

Now you're all set to build a useful AI Agent for your SRE team! If you have any questions about anything we've covered in this guide, please reach out, and we'll be happy to help. In the meantime, if there is anything missing or any other AI Agent-related thing you'd like to learn, let us know!

Don't want to build your own? Try Aptible AI.

Don't want to build your own? Try Aptible AI.

Don't want to build your own? Try Aptible AI.

© APTIBLE INC.

© APTIBLE INC.