Agents on Kevin Quinn

Thu, 30 Apr 2026 05:00:00 +0000

TL;DR: what if PR review was as engaging & low-friction as TikTok?

We’ve had a lot of innovation in AI agents producing more code, but I haven’t run across much in the way of innovation around how we do code review/PR review. The pace of new code is increasing, but humans reviewing are often still the bottleneck.

The knee-jerk reaction I have to get ahead of is “jUsT hAVe AI aGEnTs REviEw IT” - yes, that can be a tool which helps us, but it’s not a solution that covers us 100%. For one, many organizations still require a human-in-the-loop and 1-2 human reviewers for each PR, and we have to work with that constraint until Dark Factories become the norm (big maybe there).

I’m starting to explore some of these ideas in my research repo, feel free to take a look at the interactive demos, or read the context page for the distilled summary of my conversation with Claude about it.

👀 https://research.idontremember.workers.dev/tiktok-pr-reviews/

My approach to Agent Skills: Occam's Razor

Tue, 21 Apr 2026 05:00:00 +0000

My current approach to Agent Skills is to not include them unless I feel the pain without them. The majority of the time I’m running 1-3 instances of vanilla Claude Code.

To be clear, I’m not against Skills usage. I pull in Skills that have clear value, like helping agents use specific tools more effectively (e.g. playwright-cli Skill). My only hold up with AI Augments¹ is that so few provide proof that they are better than using vanilla Claude Code.

Every Skill looks the same on paper

So far I’ve found many Skills to be a lot of hype, but not a lot to back up their claims. Without clear measures, how am I supposed to distinguish between your amazing Skill you put a ton of effort into refining, and one that ostensibly does the same thing, but was one-shot prompted by a 12-year-old trying to promote their YouTube channel?

To be fair, many Skills help with activities which are genuinely hard to measure. But it feels like if they had anything at all to show they’re better than other Skills (or vanilla Claude Code), they’d be screaming it from the rooftops and you wouldn’t be able to miss it. As it stands, even for a popular suite of skills like Superpowers: “An agentic skills framework & software development methodology that works” I struggle to find any concrete measures of WHY it’s better than plain vanilla tools. Ditto for basically every other Skill i’ve considered pulling in. If we can’t actually measure a Skill’s quality, are we just fooling ourselves by tweaking the tokens we use to arrive at the same place?

It feels a bit like we’re following the PKM trend, where people build up incredibly complex Obsidian/Notion workspaces to make them more productive, and at the end of it… they have a lot of artifacts of work, but not a lot of outcomes.

Long-term hypothesis: vanilla wins

I also subscribe to the hypothesis that the best Skills will eventually be eaten by the agent coding tools (like Claude Code), so I’ll end up getting the useful ones anyway. I could be wrong, but I believe we’ve already seen this happen with Claude Code with features like auto-memory, code execution, file tools like XLSX/PDF, scheduling routines, cloud agent runs, browser automation, as well as pre-built Skills which are bundled in without a user having to add them.

It’s such a great vetting ground for the teams at Anthropic, OpenAI, etc. to watch which ones are rising to the top in popularity and then figure out how to bake them into the core product. Easier to acquire someone else’s innovation than have all the ideas yourself.

Skills, commands, whatever the next hotness is. ↩︎

Thu, 16 Apr 2026 05:00:00 +0000

Can we make agents more reliable with better deterministic tooling?

Thu, 09 Apr 2026 05:00:00 +0000

I look forward to when tooling around AI agents has standardized to solve the common frustrations. One of those is trust that the reality of what it creates matches the plan we refined together.

The features the agent & I are building get loaded into my mental model of the codebase, and each time I hit a situation where it only applied a concept to 1/2 the places it should have, I lose trust and feel like I have to spend even more time at each stage verifying every little piece. It’s the difference between handing a task to your competent coworker, vs giving the task to that one architect you wouldn’t trust to install a CLI tool by themselves. One of them gets a LGTM, the other I will dig through with a fine-toothed comb.

While there are a bazillion approaches people might say would have solved this for me (cue the BMAD, Speckit, etc crowd), I believe software engineers will eventually settle on a standard toolkit, in the same way most people use NPM. There are other options to provide competition, but we aren’t all building our own package managers.