HackNight Valencia · 2026
Alejandro Vidal · @dobleio
Ladybird adopts Rust, with help from AIhttps://t.co/MBXhpkWdHI
— Ladybird (@ladybirdbrowser) February 23, 2026
The result was about 25,000 lines of Rust, and the entire port took about two weeks. The same work would have taken me multiple months to do by hand.
We've verified that every AST produced by the Rust parser is identical to the C++ one, and all bytecode generated by the Rust compiler is identical. Zero regressions across the board.
We rebuilt Next.js in a week. No, really.
— Cloudflare (@Cloudflare) February 24, 2026
The team ported the framework to run natively on Workers to prove what's possible with edge-first architecture. Dive into the technical hurdles we solved to eliminate Node.js dependencies.https://t.co/GqYBiZ5Qum
A clean reimplementation, not merely a wrapper. This time we did it in under a week. One engineer directing AI.
Why this problem is made for AI
We started with toy programming and vibe coding - fun for exploration. But the reality: we need to scale software engineering.
A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective -
— Andrej Karpathy (@karpathy) February 4, 2026
I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off… https://t.co/yoJPmb1xuK
Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. [...] "agentic engineering"
Don't FOMO - work and practice.
Reasoning from first principles about how agentic engineering should look and how to scale teams accordingly.
What the human meant ≈ what the agent built
Agentic time >>> human time - so we are the bottleneck!
How do we find the optimal workflow that maximizes the utility of the dev team time?
utility ≈ user value, development work units, ...
Sometimes agents don't finish the task so we need to "push" them. What if we make a loop for it?
You heard it here a few months before it got popular!
while true; do
llm "$prompt"
if done_criteria || max_attempts; then break; fi
done
while grep -r "\[ \]" TASKS.md \
>/dev/null 2>&1; do
claude -p "Take the first [ ] task from
TASKS.md. Use scratchpad.md to
keep notes. Implement, test,
mark [x] and commit.
If stuck, mark [E]."
done
for i in $(seq 1 $MAX); do
claude -p "Implement TASK.md"
claude -p "Review changes, run tests.
Write issues to FIXES.md
or empty it if all good."
[ ! -s FIXES.md ] && break
done
This became so popular that Anthropic released an official skill:
/ralph-loop "Build a REST API for todos. Requirements: CRUD, validation, tests.
Output <promise>COMPLETE</promise> when done." --completion-promise "COMPLETE" --max-iterations 50
github.com/anthropics/claude-plugins-official/.../ralph-loop
👍 Great for prototyping & tasks that don't need tight steering. 👎 Burns tokens - and without guardrails, chaotic amplification kicks in (more on this later).
Zooming in on Variant 1 · See also: OpenAI Codex Execution Plans
Give the agent a living document to maintain its own state. It survives compactions - when the context window fills up and gets summarized, the file on disk retains the full picture. Each new session reads it first.
## Progress
- [x] Set up DB schema
- [x] CRUD endpoints
- [/] Add validation
- [ ] Error handling
- [B] Waiting for API keys
- [-] XML export (descoped)
## Decisions
Using zod for validation.
Auth middleware needed for /admin.
[x] done [B] blocked [/] working [-] canceled
## Purpose / Big Picture
User-visible behavior enabled.
## Progress ← checkbox-tracked, timestamped
- [x] 08:12 Scaffold routes
- [/] Wire validation middleware
- [B] Needs Redis credentials
## Surprises & Discoveries ← with evidence
DB returns null on empty join.
## Decision Log ← rationale + date
02-25 Use zod over joi: smaller bundle.
Every closed PR contains a lesson. The Cost of Failure → Zero
Launch multiple attempts in parallel. Increases the probability of finding a good solution and reduces interruptions.
Before AI: trying a major approach and failing = months. Now: hours or days. You have a near-infinite innovation budget - use it.
Real prompt sent to a remote Codex agent
The critical part is choosing the state synchronization stack. Implement a single game using N backends with different synchronization mechanisms.
For each synchronization library (list, but look for others). Requirement: self-hostable.
Run a Playwright test suite to verify everything works. Update wip.md as working memory. For each one, write a document covering pros, cons, DX, maturity. And a final summary with your conclusion.
Engineering is contextual. We interact with an environment - and the environment is both the problem and the solution.
Tests, linters, types, LSPs, documentation - they're not just for the human. They're the feedback loops that keep the agent aligned with reality.
The basics: linting, testing, and type checking. Agents need the same guardrails as devs - just automated.
Tools like opencode, Claude Code, and others already have native LSP connections. The agent gets real-time diagnostics, go-to-definition, and type info - for free.
ast-grepLinting catches syntax. Tests catch behavior. But what about structural patterns specific to your project?
# rule: no-raw-sql.yml
id: no-raw-sql-in-handlers
language: python
severity: warning
rule:
pattern: cursor.execute($SQL)
inside:
kind: function_definition
stopBy: end
message: "Use the query builder, not raw SQL"
Think of it as teaching your codebase to reject bad patterns automatically.
More rules at ast-grep.github.io/catalog
Add rules to the repo + run as a git hook - agents and humans both benefit.
They can see, but they won't catch a 2px misalignment or a subtle color shift. Feed them the diffs - and let them zoom in.
# Perceptual diffs
Every time you change anything in ui/:
1. Run:
npx playwright test --update-snapshots
git diff --name-only "*.png"
2. If changed snapshots exist, open
each one and describe what changed.
3. Zoom into the affected regions.
# Multimodal models are more
# accurate at higher zoom levels.
Add this to your system prompt or AGENTS.md. The agent will self-check its own UI changes.
Highly-dense documentation is one of the best alignment mechanisms we have. It captures what code alone cannot:
But models tend to shortcut. They solve the task and skip updating the docs.
And a single giant AGENTS.md doesn't work well either - too much context, low signal-to-noise ratio.
Instead of one monolithic file, maintain a well-organized docs/ tree that agents can navigate:
docs/
├── architecture.md # system overview, boundaries
├── design-principles.md # core invariants
├── design-system.md # UI patterns, tokens, components
├── reference/
│ ├── api.md # endpoint contracts
│ └── data-model.md # schemas, relations
├── prds/
│ ├── auth-v2.md # product requirements
│ └── billing-reform.md
├── learnings/
│ ├── why-not-graphql.md # dead ends with rationale
│ └── redis-pitfalls.md
└── security/
├── auth.md # authn/authz model
└── threat-model.md
Split by concern so the agent doesn't read the entire docs folder every time. Each file is small, focused, and independently loadable.
Define whatever structure fits your challenges. This is a continuous learning thing, not a one-shot design. Don't trust silver bullets on the internet.
But this creates a new problem: how does the agent know which docs are relevant to the current task?
This is a classification problem: "Is this file relevant?" - and you need high recall. A miss means the agent works without critical context, worsening the situation. The same problem happens with skills.
Docs and code explicitly reference each other using a flexible linking syntax.
In large projects, agents don't scale in code understanding as well as we'd like. They rely on heuristics - grep, common naming conventions - to find relevant parts. This is the main reason behind agents duplicating the same thing N times instead of reusing what already exists.
@linked comments in code point to the docs that explain the why// src/workers/api-gateway.ts
//
// @linked docs/security/auth.md
// @linked docs/prds/auth-v2.md#session-handling
//
// Validates JWT tokens and enforces RBAC.
// If you change the auth flow, update both
// linked docs.
export default {
async fetch(req: Request, env: Env) {
const token = req.headers.get('Authorization');
const claims = await verifyJWT(token, env.JWT_SECRET);
// see docs/security/auth.md#roles
if (!hasPermission(claims.role, req.url)) {
return new Response('Forbidden', {status:403});
}
// ...
}
};
@linked references."
@linked from edited files. If linked docs weren't read or edited, escalate.
Docs act as an alignment system that coordinates different parts of the code
# Data Model
## Billing
A `Subscription` belongs to an `Organization`.
Each has one `Plan` and zero or more `Addons`.
### Invariants
- One active subscription per org.
- Downgrading preserves addons until renewal.
- All monetary values in cents (int64).
@linked rules/no-bigint.yml enforces it with ast-grep.
Editing any node triggers reading all connected nodes. The agent either updates them in sync - or the hawk flags the inconsistency.
No algorithm, test, or linter can catch:
This is the main human bottleneck today.
alignment ≈ what the human meant = what the agent built
Garbage in, garbage out. Agents copy and replicate what they observe in context.
RL-trained to be efficient. They work around legacy problems, increasing tech debt.
Stop the chaotic oscillation as soon as possible.
Agents find shortcuts that satisfy the reward signal but miss the intent.
Double meaning: RL is used during training - so this behavior carries over to inference.
Goal: "Make the chatbot work with this API"
Agent: API key doesn't work → mocks the entire API
Result: ✅ All tests pass!
Reality: ❌ Nothing actually works
Similar situations:
Very common in Codex - OpenAI is strong at RL but alignment is less refined. My hypothesis on why people love Claude: it's very well aligned.
Models keep going because they're RL-trained to use their turn fully. Give them an escape hatch.
Two ways to implement it: explicit - prompt the agent to evaluate confidence before acting. Or implicit - just provide the tool and the agent learns when to call it.
Inspired by @karpathy: "The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk"
# .claude/settings.json
{
"hooks": {
"PostToolUse": [{
"matcher": "Edit|Write|Bash",
"hooks": [{
"type": "command",
"command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/hawk.sh"
}]
}]
}
}
# .claude/hooks/hawk.sh
INPUT=$(cat)
TRANSCRIPT=$(echo "$INPUT" | jq -r '.transcript_path')
VERDICT=$(claude -p "You are a hawk reviewer.
Read the transcript. Is the agent on track?
CONTINUE / STOP <why> / ESCALATE <why>" \
< "$TRANSCRIPT")
case $VERDICT in
STOP*)
echo "{\"decision\":\"block\",\"reason\":\"${VERDICT#STOP }\"}"
exit 0 ;;
ESCALATE*)
notify-send "🦅 ${VERDICT#ESCALATE }"
echo "{\"continue\":false,\"stopReason\":\"${VERDICT#ESCALATE }\"}"
exit 0 ;;
esac
Claude Code hooks receive session_id + transcript_path via stdin JSON - the hawk reads the full conversation history, not just the last tool output.
{"decision":"block","reason":"..."}{"continue":false}JSON is only processed on exit 0.
They tend to self-preference.123 Use different providers.
My favorite pair:
Hawk is highly recommended. As we said - no linter or deterministic set of rules can catch these issues.
[1] Panickssery et al. arXiv:2404.13076 · [2] Wataoka et al. arXiv:2410.21819 · [3] Chen et al. arXiv:2504.03846 · [4] CALM framework - 12 biases in LLM-as-a-Judge
Context that harms performance - and survives propagatedally (over compactions / new sessions)
The common workflow implement → review → fix tends to propagate contaminated context. Even a reviewer agent reading the code inherits it.
Don't carry the context. Carry only the scars.
Development cycles that made sense with humans don't make sense with agents. Debt reduction is no longer something you do at the end of each sprint or each quarter.
The faster you clean, the less the agent learns bad patterns. The less it learns bad patterns, the less debt it creates. Virtuous cycle.
Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up 'AI slop.' Unsurprisingly, that didn't scale.
Prompt for the real intent and hard constraints. Explain the why:
"Don't write types before the implementation. Types must reflect actual runtime behavior - compile-checks alone are not enough."
"All logs must use structured format with correlation_id. We rely on these fields for distributed tracing in production."
"Never swallow errors with generic try/catch. Every error must propagate or be handled with a specific recovery path."
"Before writing a new function, search for existing ones with similar intent. Abstract if overlap > 50%. We can't fix N copies."
Try fast, learn, rollback, retry. Every failed branch feeds the knowledge base for the next one.
docs/ - every failure enriches the next attempt// learnings.md generated by the failed branch feeds the next prompt:
"Use WebSockets for real-time sync - not polling.
SSE dropped messages under load, see @docs/learnings/attempt2-sse-failures.md
Redis pub/sub had ordering issues, see @docs/learnings/attempt3-redis-ordering.md"
Agents don't do this alone. They need a harness - at minimum a hawk, a human envelope, and orchestration.
Triage: easy, probably one-shottable tasks vs. tasks that need supervision. Automate this flow and watch your backlog go to zero.
Code diffs are necessary but not sufficient. Visual artifacts compress review time.
For planning and reviewing.†
A technique popularized by Claude Code using AskUserQuestionTool. The agent interviews you with structured multi-choice questions to enhance the specification in a conversational manner - before writing a single line of code.
Generate throwaway interactive artifacts to review a plan, validate a design, or get structured feedback on an implementation. The artifact is the review surface - not the code diff.
The same pattern applied to code review - the agent walks through a file against a spec or checklist and produces structured, interactive feedback.
The literate programming pattern (hi, notebooks!) applied to agent output verification.
Agents build executable Markdown docs: code blocks run, output is captured, images embedded.
showboat verify report.md
Re-runs everything and diffs. If the output changed, you know something broke - no test code required.
A video with a walkthrough testing the UI change or showing the full user story is by far more time-efficient than reading the equivalent code. cursor.com/blog/agent-computer-use
If an agent can't use it, it doesn't exist.
AGENTS.md, CLAUDE.md, good structured prompts, a scripts/ folder with common tasks, linked docs with code↔docs linking, hooks, good linting, LSP. The codebase must explain itself to non-human readers.
Huge infra startup opportunity here. This problem is a huge pain and it is not solved at all.
How will an agent fix your random bug if it has no access to flamegraphs, container logs, or the build output?
Claude Code on desktop can now preview your running apps, review your code, and handle CI failures and PRs in the background.
— Claude (@claudeai) February 20, 2026
Here's what's new: https://t.co/A2FdH045Tt
Preview running apps, handle CI failures, review PRs - all in the background. This is what agent-accessible infrastructure looks like.
This is the new normal.
The scaffolding needed to orchestrate dozens of agents with good mental health is equivalent to the engineering you need for 20+ developer teams:
None of this is new. Large engineering orgs already invest in this. The difference is that now a team of 3 needs it too.
The highest ROI for many companies right now: build that scaffold.
Not for the agents you have today - for every project you'll run from now on.
Instead ask: what is missing in the agent harness to do this?
A missing hook, a missing CLI command, a missing test fixture, a missing doc link - that's the real work now. Every time you take over from the agent, you're patching a gap in the system instead of fixing it.
1–2 engineers per medium-size project
OpenAI reports teams of 1–2 members managing projects of ~5M LoC. The engineer doesn't write most of the code - they orchestrate, review, and decide.
T-shaped knowledge: broad understanding of the architecture, deep expertise in specific modules. The agent handles the breadth; the human owns the depth.
Small teams, many agents
A team of 3 can manage hundreds of concurrent agent sessions across different concerns - features, tests, refactors, debt cleanup, documentation.
The limiting factor is no longer writing speed. It's review bandwidth and architectural judgment.
As codebases grow, each engineer's knowledge shrinks relative to the whole. You stop understanding systems you built six months ago. New joiners never understood them at all.
This is not a documentation problem - it's a knowledge distribution problem. And it compounds with every sprint.
Here are 40 pages of onboarding docs. Read them and set up your environment. Ask Dan if you get stuck. Good luck, see you in two weeks.
Here is the repo. It has a well-configured agent. Ask it anything you want, then create your first refactor on the payments module. You have 3 days. Good luck.
The agent is the documentation. It reads the code, the tests, the git history. It's always up to date because it reads the source of truth.
Timelines below are IMO, based on my experience. Your mileage will vary.
I - Cybersecurity will explode. Agentic security review is happening right now.†
II - Recursive language models unlock the next level. Agents that reflect on and edit their own prompts mid-task.
III - Backlog management will change. "If it's less than 3 prompts, do it now."
IV - Stricter languages rise. Stronger guarantees, penalizing DX. The developer won't be there. Rust is an early sign.†
V - Expressive languages and DSLs arise. Less text → more efficiency generating and reviewing. DSLs will beat general-purpose languages.*†
VI - Test ratio explodes. Tests now verify "is the development aligned?" Creating tests costs → 0. Agents can make tests green in trivial and unexpected ways.
* CLIs are a "simple DSL" we already use through the console every day.
We've seen mainstream adoption of Claude Code across non-eng in the last six weeks at @tryramp. 80% of PMs, 70% of compliance, 55% of the finance team. It's changed how I think about the role of the data team.
— Ian Macomber (@iandmacomber) February 17, 2026
This week we're going to begin automatically closing pull requests from external contributors. I hate this, sorry.
— tldraw (@tldraw) January 15, 2026
January: tldraw auto-closes all external PRs. The volume of low-quality agent-generated contributions made review impossible.
Wow, @tldraw is moving their tests to a closed source repo to prevent a Slop Fork
— Malte Ubl (@cramforce) February 2026
February: they move tests to a closed-source repo to prevent slop forks - agents rewrite code until tests pass without understanding anything.
Been wrangling a lot of time how to deal with the onslaught of PRs, none of the solutions that are out there seem made for our scale.
— Peter Steinberger 🦞 (@steipete) February 22, 2026
I spun up 50 codex in parallel, let them analyze the PR and generate a JSON report with various signals, comparing with vision, intent…
Open-source at scale: 3,000+ PRs and no human pipeline can keep up.
The solution? 50 parallel agents triaging PRs into structured JSON - intent, vision diff, code quality signals - then one reasoning pass to merge or close.
Bug Reports → User Prompts
Bug reports are dead
— Nick Dobos (@NickADobos) February 24, 2026
Simply get your customers to prompt your coding agent
User feedback becomes executable. "I wish this did X" goes straight to a PR instead of dying in a backlog.
The feedback loop collapses: user prompt → agent → PR → review → deploy.
The user is no longer filing a ticket. They're prompting the system directly.
Obviously this needs guardrails - review, sandboxing, merge policies. But the direction is clear.
This is happening today, February 26, 2026 - the day of this talk. Btw: Security is the next frontier of the vibe coding debate.
We've identified, responsibly disclosed, and confirmed 2 critical, 2 high, 2 medium, 1 low security vulnerabilities in Cloudflare's vibe-coded framework Vinext.
— Guillermo Rauch (@rauchg) February 26, 2026
We believe the security of the internet is the highest priority, especially in the age of AI. Vibe coding is a useful…
— Guillermo Rauch (@rauchg) February 25, 2026
Alejandro Vidal
Founder of Mindmakers
If you find this talk useful, a simple ask:
Materials
Thanks to HackNight Valencia and Flywire for the organization.
HackNight Valencia · 2026
From my own Claude Code and Codex sessions
PPTX generation · Claude Code
Generate a short and long version of a PowerPoint presentation using pptxgenjs.
The agent edited generate-pptx.js 77 times and generate-pptx-largo.js 49 times in a single session.
git stash/pop to recover - restored wrong edits"OK, the current file is a mess - it has the first CIERRE comment followed by the detail slide code still hanging in there, then the second CIERRE"
// One script, one flag
node generate-pptx.js --long
node generate-pptx.js --short
// Eliminated 1,500 lines of duplication
Failure mode: shallow duplication → chaotic oscillation
ETL App · Claude Code
# app.py
app.secret_key = os.environ.get(
"SECRET_KEY",
secrets.token_hex(32) # random per process
)
# Dockerfile
CMD ["gunicorn", --workers, "2",
"app:app"]
Each piece individually reasonable. --workers 2 is Gunicorn's recommended production config. token_hex is a safe fallback.
Each Gunicorn worker is a separate process. Without a fixed SECRET_KEY, each worker generates its own random key.
1. User logs in → Worker 1 signs cookie with Key-A
2. Next request → Worker 2 has Key-B → cannot decrypt
3. Session = None → redirect to login
"Es como tener dos cerraduras distintas y solo una llave - funciona la mitad de las veces."
Failure mode: compiles, passes tests, breaks 50% of the time in prod
Common pattern across multiple projects
test('renews subscription and charges',
async () => {
const sub = await createSubscription({
plan: 'pro', userId: 42
});
const result = await renewSubscription(sub.id);
expect(result.status).toBe('active');
expect(result.chargedAmount).toBe(29_99);
expect(result.nextBillingDate)
.toBeAfter(new Date());
});
test.skip('renews subscription and charges',
async () => {
const sub = await createSubscription({
plan: 'pro', userId: 42
});
const result = await renewSubscription(sub.id);
expect(result.status).toBe('active');
// TODO: chargedAmount returns 0
expect(result.chargedAmount).toBe(29_99);
expect(result.nextBillingDate)
.toBeDefined();
});
Common pattern: "add error handling" → agent wraps everything in try/catch
async function processPayment(invoiceId) {
try {
const invoice = await
db.invoices.findUniqueOrThrow({
where: { id: invoiceId }
});
const charge = await
stripe.charges.create({
amount: invoice.amountDue,
payment_method: invoice.methodId,
});
return { success: true };
} catch (err) {
logger.error('Payment failed');
return { success: false,
error: 'Something went wrong' };
}
}
A declined card, an expired method, an invalid currency, and a database outage all produce the identical log:
ERROR Payment failed
No stack trace. No error code. No invoice ID. No Stripe error type.
Tests pass because the function returns a clean { success: false }. The bug is invisible until production, when you need to figure out why a customer's payment is failing and your logs are useless.
Failure mode: "add error handling" ≠ "handle errors"