AI

HACK DAY: Unleash the Power of AI Agents in Your Development Workflow

Jared Brook

5 Minute Read

Every six weeks, our global team gathers for HackDay to collaborate, experiment and push the boundaries of software development.

For our most recent session, we turned our focus to agentic AI self-directed agents powered by large language models (LLMs). We set out to discover how these autonomous assistants could streamline our workflows, improve observability and give us real-time feedback as we code, test and iterate.

To explore this, we mixed AI-first (Integrated Development Environments) IDEs, containerised environments, local LLMs, and modular tooling frameworks into a series of hands-on experiments. Our goal was simple, learn what works (and what doesn’t) so we can share actionable insights with our customers. Read on to see what we uncovered and how you can apply these lessons in your own projects.

 

Several insights were uncovered:

1. Exploring AI-First IDEs 

Windsurf & Cursor vs. VS Code/CoPilot 

Switching from traditional editors to AI-centric environments revealed three distinct workflows. 

We found that Windsurf ‘bounces’ around your project, surfacing context-aware snippets and allowing you to move between tasks without relying on a file tree. Whereas Cursor lives in your terminal and treats every step as a conversational session with your code. While VS Code paired with Copilot integrates AI suggestions directly into your workflow by offering autocompletion and full-function drafts within the familiar file and tab interface. 

These AI-native IDEs proactively provide relevant context and accelerate exploration and prototyping.

Key takeaway: AI-native IDEs not only speed up coding but also deliver smarter, more precise suggestions and solutions exactly when you need them.

 

2. Building Agents on Strands Agents SDK

Strands Agents is a lightweight, code-first SDK for crafting AI agents in just a few lines of code. In our experiments we used it to explore three core patterns:

Context Population from External Knowledge
    • Loaded instructions and reference docs at runtime
    • Queried vector or document stores to feed the agent relevant facts
Prompt Externalisation
    • Kept prompts in separate files or databases so they could be versioned and tuned independently of code
    • Injected dynamic variables (e.g. user IDs, timestamps) at execution time
Long-Term Memory via State Persistence
    • Persisted agent state (memory of past actions, variable values) to external stores (Redis, DynamoDB)
    • On each run, agents “remembered” what they’d already solved – skipping redundant steps and even suggesting missing tools when gaps appeared

Key takeaway: By decoupling your code from your prompts, context sources, and memory store, Strands makes it easy to iterate on each layer independently driving faster prototyping and more reliable agent behaviour.

 

3. Running LLMs Locally with Ollama

To test on-premises LLM hosting, we containerised two models in Ollama: one heavyweight model without tool support (fast but limited) and a mid-sized model with tool integrations that still fell short of cloud-hosted application programming interfaces (APIs) like Claude and Bedrock.

Key takeaway: On-premises hosting is promising for privacy and latency, but quality, memory integration, and tool chaining still lagged managed services like Claude or Bedrock.

 

4. Custom MCP server setups

To isolate concerns and improve maintainability, we ran our Strand-based agents on two separate Modular Component Protocol (MCP) servers:

  • Tooling MCP Server: Exposed and managed all external utilities – such as the AWS Command Line Interface (CLI), data‐fetching APIs, and custom scripts
  • Client Logic MCP Server: Handled the agent’s internal workflow – receiving prompts, orchestrating memory persistence, formatting responses, and enforcing conversational state

This split architecture revealed two key insights:

  • Policy guardrails need explicit wiring. Declaring a “do not extract personal data” rule in documentation alone won’t enforce it – you must hook each MCP server into validation and error-handling pipelines that actively block forbidden operations
  • Environment shapes behavior. When we ran the same agent image in a container with AWS CLI binaries versus one with only networking tools, the agent introspected its runtime (using commands like which) and adapted its execution plan – sometimes skipping steps it believed weren’t feasible

Key takeaway: A container’s toolset is part of your agent’s “brain.” By curating exactly which Command Line Interface tools and libraries are available in each MCP environment, you guide agents toward safe, reliable behaviors.

 

5. Deep-Dive Experiments

To challenge our agents across critical workflows, we designed a set of deep-dive experiments that each focused on a specific real-world task. Here’s what we tried:

Documentation scraping:

We pulled down reference manuals from the web, then quizzed our LLM on accuracy finding it improved dramatically when upgrading to a more capable model.

Diagram generation: 

By chaining two MCP endpoints (one for GitHub auth, another for drawing), the agent generated and executed Python code to produce architecture diagrams on demand.

Policy Automation:

We provided the agent with the AWS CLI plugin for Identity and Access Management (IAM). Once it had the right tooling, the agent could autonomously draft, validate and refine IAM policies. When we deliberately broke the plugin and then restored it, the agent detected the failure, retried the operation and successfully completed the policy generation demonstrating its ability to recover from errors without manual intervention.

Slack integration: 

A local Slack-bot agent iteratively tried CLI commands, parsed error messages, self-corrected, and then posted root-cause analyses back to the channel demonstrating feedback loops missing in less capable models

Key takeaway: Upgrading to higher-capacity LLMs and chaining modular MCP services unlocks fully automated workflows, while supplying the right CLI tooling and embedding agents in platforms like Slack lets them self-recover from errors and provide real-time analysis without manual intervention.

 

Best Practices & Recommendations

By combining AI-first IDEs, containerized sandboxes, local LLMs and modular tooling in hands-on experiments we watched where agents excelled and where they tripped up. Now, here are the distilled insights we’ve collected for you, so you can skip our missteps and get straight to what works:

  • Seed a clear persona: Instruct your agent to surface missing tools or “voice” suggestions when it spots gaps.
  • Template your logs: Provide a schema for logging to files, tables, or APIs to keep outputs consistent.
  • Balance context and noise: Too much documentation overwhelms; too little, and hallucinations ensue.
  • Leverage containers as config: Only use the CLIs and binaries your agent truly needs.
  • Combine managed and local models: Use hosted LLMs for core reasoning and local models for sensitive, high-throughput workloads.

Tooling for Your Teams 

Based on our exploration, we recommend using the following tools for your teams:

  • Windsurf or Cursor for AI-first coding
  • Strands paired with modular MCP servers for flexible agent pipelines
  • Ollama for local LLM hosting (with robust memory and tool support)
  • OpenAI Guardrails (or similar frameworks) to enforce policy and privacy
  • Container-driven deployments to precisely control each agent’s environment

What Have We Learnt?

Our Hack Day experiments showed that hosting models on premises - using local LLMs in containerised environments - provides unmatched privacy and low latency. This makes it the best option for sensitive or time-critical tasks. However, it still falls behind managed services such as Claude and Bedrock in terms of model sophistication, memory management, and smooth chaining of tools.

By contrast, local hosting shines during development because it is affordable, safe, and fast - especially when combined with modular tooling frameworks and AI-first IDEs like Windsurf and Cursor - even though it does not offer every advanced feature or the peak performance available from cloud APIs.

When planning your own AI agent deployments, reflect on what matters most. If privacy or latency are critical, choose on-premises hosting. If you need the deepest capabilities and closest integration, rely on cloud-hosted models - ideally paired with flexible agent architectures and smart development environments to get the best of both worlds.

Inspired by Hack Day? Let base2Services show you how to integrate AI agents into your development workflow. Contact us today to schedule a tailored consultation!



More Blog Posts