Hack Day: Teaching Our Tools to Think for Themselves

Written by Jared Brook | Jun 25, 2026 6:31:38 AM

Our HackDays have covered a lot of AI ground over the past year - agentic IDEs, agent-to-agent protocols, local LLMs. But most of that work was about using AI tools as they ship. This time, the optional theme was "agent skills", and the question shifted from "What can these tools do out of the box?" to "What happens when we teach them to do things they can't do yet?"

The concept behind skills is straightforward. A skill is a folder containing a SKILL.md file - structured instructions that tell an AI agent how to perform a specific task it wouldn't otherwise know how to do. Think of them as runbooks for an LLM: instead of explaining the same multi-step workflow every time, you encode it once and the agent follows it on demand. The format originated at Anthropic and has since been adopted as an open standard across a growing number of tools - Cursor, Claude Code, Gemini CLI, GitHub Copilot, VS Code, and many others.

Building Skills That Actually Work

Creating useful agent skills from scratch came down to three practical questions: how should the agent behave, where should it get context from, and how can skills be triggered reliably?

One experiment started with a question most of us have probably had. What if the agent pushed back when you asked it something you could easily look up yourself? The result was a "Google-it" skill for Cursor - a set of instructions that detects when a user is offloading a simple or trivial question to the agent and challenges them to find the answer on their own first. It's a small, slightly cheeky idea, but it touches on a real dynamic. As these tools get more capable, the temptation is to use them as a substitute for thinking rather than a supplement. The skill worked as a proof of concept, and the broader takeaway was that a well-defined skill can change an agent’s behaviour through structured instructions, without needing a complex plugin or large codebase.

Slack offered a clear test case for a context-aware skill - one that could pull messages from Slack channels and threads to give an agent better context for tasks like writing post-incident reviews, summarising investigations, or generating status updates. The idea is that a lot of useful context lives in Slack conversations that never makes it into documentation or tickets. If an agent can access that context directly, the quality of its output improves significantly. This was more exploratory than finished product, but it validated the approach of using skills to bridge the gap between where information lives and where it's needed.

The skill system itself raised another important question - how skills are discovered, triggered, and composed across different agents. The focus was on understanding the mechanics: how an agent decides which skill to invoke based on the description metadata, how context flows between skills through progressive disclosure, and where the edges are. Since skills are portable across tools that support the format, understanding these mechanics once means you can author skills that work in Cursor, Claude Code, or any other compatible agent. This kind of foundational work is less flashy than building a specific tool, but it pays off the next time someone on the team needs to build a skill that actually works reliably.

Running LLMs Without the Cloud

Running models locally raised a different kind of question. When does local inference become practical, and where do hardware limits still get in the way?

A self-correcting coding agent was a useful way to test that question on local hardware using Ollama. The design was a Python harness that Cursor invokes as a skill. When triggered, it sends a task description and test file to a local Ollama model, extracts the code from the response, writes it to disk, and runs the test suite. If tests fail, the errors go back to the model and it tries again - looping until either everything passes or it runs out of retries. The system is stateless between retries by design, resetting to the system prompt plus latest code plus errors each time, which avoids context window bloat on smaller models.

Testing revealed a clear capability gradient. The 8B parameter Gemma model was fast - three to four minutes per response - but struggled with anything beyond straightforward tasks. It managed 16 out of 23 tests on its best attempt but couldn't handle edge cases that required it to avoid specific built-in functions. The 26B parameter model was too large for the available GPU and timed out during loading. The mechanical loop worked well. The model genuinely improved across iterations. However, model capability was the bottleneck. The practical conclusion is that local models are viable for constrained tasks with clear test coverage, but you need to match the model size to the hardware you have and the complexity of the problem you're solving.

Pairing Ollama with OpenClaw offered a second look at the same challenge from an orchestration perspective. The setup worked for basic use cases, but integration with WhatsApp and the larger Gemma 4 image hit a wall - the model required 21GB of memory on a machine with only 20.9GB available. It's the kind of frustrating near-miss that's common with local LLM work right now, where the gap between "almost fits" and "runs reliably" is still significant.

Smarter Alert Routing with EventBridge

Not everything was about AI. A practical infrastructure problem also came into focus: the gap between the events our health dashboard generates and the alerts our support team actually needs to see. The current pipeline pushes events through Amazon EventBridge into our alerting platform, but the filtering available at that boundary is coarse - you can match on event type, but not much else. If a non-critical resource generates the same event type as a critical one, both land in the same alert queue.

The solution was an event parser - an intermediate step between Amazon EventBridge and our alerting platform that adds fine-grained filtering logic. With this in place, alerts can be routed based on specific resource tags, severity combinations, or custom rules without changing the upstream event structure. It's not a large architectural change, but it directly reduces noise for the people on call and gives us more control over what warrants a page versus what can wait for business hours.

Mounting S3 as a Lambda File System

A recently launched AWS capability was also worth testing: S3 Files, which lets you mount an S3 bucket as a local file system inside a Lambda function. The appeal is obvious - Lambda functions that need to read or write files currently have to go through the S3 API, which adds complexity and latency for workloads that are fundamentally file-oriented. With S3 Files, the function just reads from and writes to a local path.

The setup involved creating a VPC-attached Lambda with a versioning-enabled S3 bucket, configuring the appropriate IAM permissions for s3files:ClientMount and s3files:ClientWrite, and pointing the mount at the right access point folder. It's the kind of feature that won't change how most Lambda functions work, but for the subset that deal heavily with file I/O (data processing, report generation, configuration management) it simplifies the code considerably.

What We Took Away

The day reinforced something we've been noticing across recent Hack Days: the gap between "using AI tools" and "extending AI tools" is closing fast. Building a skill that meaningfully changes an agent's behaviour took hours, not weeks. The infrastructure for local models is close to practical for constrained use cases. And the ability to compose skills - chaining a Slack integration with a summarisation task, or a Trello lookup with a report generator - means the return on each individual skill compounds as the library grows.

The projects that worked best shared a common trait - tight scope with clear success criteria. The coding agent had a test suite. The event parser had a specific filtering gap to close. The skills had defined trigger conditions and expected outputs. When the scope was fuzzy - "explore this tool" - the results were useful for learning but harder to turn into something the team could use the next day.

That's the rhythm we're settling into with these Hack Days. Each one produces a few things that go straight into production, a few that need another iteration, and a few that teach us something we didn't know we needed to learn. The skills work in particular feels like it has legs. The format is standardised, portable across a growing list of tools, and the investment in writing a good skill pays off every time it's invoked - by any team member, in any compatible agent.

Want to explore how agent skills or local LLM workflows could fit into your team's development process? Get in touch - we're always happy to share what we've learned.

View full post