Here's What AI Agent Found When We Let it Grade All Our Repos

When you roll out AI coding assistants to an engineering team, the tooling conversation centers on prompts, models, and context windows. But what gets almost zero attention is whether each repository is configured to give the AI useful context. At Promova, we use Claude Code across our engineering org — everything from mobile apps to billing services. A few months in, we noticed that the same tool produced wildly different results depending on which repo you were in. In one service, Claude understood our conventions, knew the build commands, and suggested code that matched our patterns. In another, it was guessing. The difference was whether someone had written a decent CLAUDE.md. \ We had no visibility into this, and no one knew which repos had solid AI configuration and which had a blank file someone created "just to have something." Auditing manually across dozens of repos wasn't realistic. \ A deeper reason we care about this is that we've been building toward an LLM-first approach to development — where AI agents can work effectively across any codebase in our org. For that to work, every repo needs to be machine-readable in a consistent way: conventions documented, build steps explicit, context structured. Our internal knowledge platform indexes all of this. When the context is there, the agent can build on accumulated learnings and avoid hallucinating conventions that don't exist. When it's missing, you're back to guessing. CLAUDE.md is the first file in that standard, and ADRs, conventions, and other structured context come next. Fleet Health is how we enforce the baseline. \ So, we built it to make the invisible visible: which repos are ready for LLM-first development, and which aren't. Fleet Health Workflow Fleet Health is a periodic workflow that crawls all repositories in our GitHub org, checks five things, and produces a score from 0 to 100 (grade A through F). It runs on a Temporal schedule, sharded across days of the week so we're not hammering GitHub on one day. Each repo gets checked roughly once per week, with manual re-runs available from the admin UI when needed. The output looks like this: a repo-by-repo grade table with drill-down per check, plus an LLM-generated improvement report for anything below grade A. The Five Checks We settled on six signals after a few iterations. They map to what actually makes a difference when an AI agent works in a repo. Note the structure: five file paths fetched from GitHub, but CLAUDE.md produces two checks (presence and quality), giving six total. CLAUDE.md present (weight: 20) Binary. The file exists, or it doesn't. No file means the AI gets zero context about the project. CLAUDE.md quality (weight: 30). This is the only check that uses an LLM. We pass the file to Claude Haiku with a structured rubric: Does it contain build commands? Test commands? Coding conventions? Is it not obviously a placeholder? The model returns pass, warn, or fail — nothing else. Temperature 0, max 256 tokens. One thing we had to build explicitly: the file content is wrapped in delimiter markers, and the model is told it's untrusted input. CLAUDE.md files get edited by many people and could contain anything, including accidental prompt injection patterns. The instruction boundary is explicit, not implicit. .claude/settings.json present (weight: 10). Where permissions and tool access live. .claude/hooks/ directory present (weight: 15). Hooks are how you run pre/post tool execution logic — lint, test, safety checks. Their absence doesn't break anything, but it's a signal that the team hasn't thought about the AI's operating environment. .claude/rules/ directory present (weight: 10) Project-level rules for AI behavior — conventions, constraints, do-not-touch patterns. Separate from hooks because they're declarative rather than executable. .github/workflows/ present (weight: 15) Proxy for CI. Repos without CI tend to have lower overall engineering hygiene, which correlates with worse AI collaboration outcomes. Scoring is weighted: pass = full weight, warn = 50%, fail = 0%, NA = excluded from denominator. Weights sum to 100. A repo is missing CLAUDE.md but with solid CI might score around 30. One with a quality CLAUDE.md and everything else present hits 90+. The Architecture \ Three activities, two worker queues. The fetch step runs on our standard workers — it's pure HTTP calls to GitHub, no LLM involved. The evaluation and report generation run on a separate worker pool that holds the Anthropic API key. This matters: any activity that calls an LLM gets explicitly routed there. Keeps the LLM API surface minimal and makes cost attribution straightforward. Queue routing matters here. We have two worker types: general workers and a dedicated LLM worker. Only the LLM worker holds the Anthropic API key, so any activity that calls an LLM gets routed there. The fetch activity — pure HTTP calls to GitHub — runs on the general queue. This keeps the LLM API surface minimal and makes cost attribution straightforward. Sharding is simple: hash(github repo id) % 7. \ Each repo gets assigned to a day of the week and only gets checked on that day. The daily run processes ~1/7 of the org, not everything at once. The fetch step reads exactly five paths. We deliberately kept this narrow — the goal is AI configuration health, not a general repo audit. More signals would mean more noise and more false grades. Five paths produce six checks because CLAUDE.md gets evaluated twice: once for presence, once for content quality. The Grading Scale A 90–100: Everything configured, CLAUDE.md is substantive. B 75–89: Almost everything present, minor gaps. C 60–74: CLAUDE.md exists but is weak, or something is missing. D 45–59: No CLAUDE.md but has CI or hooks. F 0–44: Nothing configured. \ After the first org-wide sweep, the distribution looked approximately like what you'd expect when adoption is organic and untracked: a cluster at the top (repos maintained by engineers who take AI tooling seriously), a long tail in the D–F range (repos that had never been touched), and a messy middle of C-grades that turned out to be the most interesting category. \ The C-grade repos usually had a CLAUDE.md, but it was either a one-liner placeholder or three-year-old instructions that referenced a build system we'd moved away from. The LLM judge caught these — a file that technically exists but says "TODO: fill this in" gets a fail on the quality check, which pulls the score below 75. The "How to Fix" Report For any repo below grade A, the system generates a markdown improvement report: what's missing, what's weak, and specific instructions for bringing the score up. This runs as a separate activity after scoring. The prompt is intentionally opinionated: failing checks are listed in descending weight order (so the most impactful fix comes first), and the model is told to write ~500 words of practical guidance, not generic advice. \ The output gets stored in MongoDB and surfaced in the admin UI on the repo's health page. In practice, this report is what engineers actually read. The grade gives you a quick signal; the report tells you what to do about it. What We Did With the Data The fleet view became a prioritization tool. After the initial sweep, we identified repos with active development but low health scores — these were the highest-leverage targets, because someone was using AI there already but without proper context. We started with repos in the D–C range that had recent commit activity. \ The improvement cycle was fast: the engineer reads the report, adds build commands and conventions to CLAUDE.md, re-runs the health check, and the score jumps. Most repos moved from D to B in under an hour of work. The F-grade repos were almost all dormant or archived. We treated those differently — a low health score on a repo nobody touches is noise, not signal. What We'd Do Differently The LLM quality check is the right idea, but it needs calibration. We initially set the pass threshold at "3 of 4 criteria met" and found it too generous — files with build commands but zero conventions were passing. We tightened it, but it's still an approximation. A rule-based check (minimum word count, presence of specific sections) might be more reproducible. Sharding by day creates a visibility lag. If you fix a repo on Monday but it's scheduled for Thursday, you won't see the updated score for four days unless you trigger a manual run. For active work, this is friction. \ We added the manual re-run button early; it gets used a lot. The "NA" verdict on CLAUDE.md quality needs to be visible. If CLAUDE.md is absent, the quality check returns NA and gets excluded from scoring. This means a repo with nothing configured can still score higher than expected if the possible_points denominator drops. We added explicit labeling in the UI to show when NA verdicts are affecting the score, but it took a few confused engineers before we got there. The Bigger Picture Fleet Health is one part of a larger push: making AI-assisted development consistent across the engineering org, not just good in the repos where someone happened to invest in setup. The insight that drove this was simple — AI tool adoption is visible (you can see who's using Claude Code), but AI tool effectiveness is invisible without instrumentation. A developer hitting walls because their repo has no context will quietly stop using the tool or attribute the problems to the model. Fleet Health makes the configuration gap visible and actionable. \ Grade distribution across the org is now a metric we track. It's not a perfect proxy for AI effectiveness, but it correlates with something real: teams that maintain A-grade repos tend to report fewer "the AI doesn't understand our codebase" complaints. To be clear about scope: this is a beta. We're not saying this is the right rubric, or that five files are the right signal set, or that an A grade means the repo is well set up for AI-assisted development in any deep sense. \ What we're saying is that without something like this, the onboarding cost for LLMs across codebases falls entirely on individual engineers — and it accumulates silently. This system makes that cost visible and gives teams a concrete starting point. \ CLAUDE.md is just the first file in what we intend to be a broader standard: ADRs, team conventions, dependency context, architectural decisions — all structured for machine consumption alongside human readability. Fleet Health's job right now is to ensure the baseline exists. The rest comes after. \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook

Walmart-backed Flipkart expands quick-commerce push as Amazon ramps up in India

TechCrunch

TechnologyJun 24, 2026 · 1 min

A Facebook Live and a shooting: Why Bharat Tiwari’s death in police encounter has shaken Bihar

Indian Express

Here's What AI Agent Found When We Let it Grade All Our Repos

Related stories

Walmart-backed Flipkart expands quick-commerce push as Amazon ramps up in India

Harry Styles’ Wembley concerts introduce heatwave‑related measures for fans

This tablet replaced both my iPad and Kindle and it's 40% off this Prime Day

A Facebook Live and a shooting: Why Bharat Tiwari’s death in police encounter has shaken Bihar