Kartik Patel
In ProgressConfluenceAzure DevOpsGitHubLLMsMeasurement

AI SDLC Enablement Playbook

Case study. In progress. Internal AI developer platform.

Context

I am running an internal AI developer platform for an engineering org of roughly 25 engineers. The goal is simple to describe and slow in practice: give engineers a place to ask grounded questions about our codebase, runbooks, ADRs, and operational history - and reduce the time they spend reconstructing context that already exists somewhere in writing.

This page is a working case study. It will keep changing as the project does.

The problem before the platform

The bottleneck on most engineering work is not typing. It is context: where is the relevant ADR, which runbook covers this service, who owns this module, what did we decide last time, what changed in the last release. Engineers were finding that context manually - searching Confluence, scrolling Azure DevOps, asking each other on Slack, reading the wrong document, eventually asking the right person.

Most "AI for developers" rollouts I see in peer orgs treat this as a code-completion problem. That is part of it, but it is not the biggest part. The bigger part is grounded retrieval against the org's actual documentation and history.

What the platform connects

The platform pulls from four data sources:

  • Confluence: engineering documentation, runbooks, ADRs, postmortems
  • Azure DevOps: work items, sprints, status, recent changes
  • GitHub: code, commit history, PR discussions
  • Internal databases: operational state where the answer to "what is running where" lives

The retrieval layer is grounded in these sources. Citations are mandatory: every answer points back to a document, page, or file the engineer can open and verify.

What workflows it supports today

  • Asking grounded questions about services, runbooks, and ownership
  • Pulling ADR context and historical decisions without searching by memory
  • First-pass triage of support overflow tickets: highest-ROI workflow in my experience, because the baseline is bad
  • Onboarding context for engineers ramping on unfamiliar services

The platform is not a coding agent. It is a context layer.

What I am measuring

  • Cycle time, by work type: bug fix, small feature, larger feature, infrastructure, on-call. AI tooling helps with some of these much more than others. Aggregating washes out the signal.
  • Quality at the boundary: defects found in code review and post-deploy issues, normalized for code change volume. Faster should not come with a quality cost.
  • Support ticket triage time: the workflow with the biggest delta in my experience.
  • Context-gathering time: proxied through how often engineers search wikis, ping each other for runbook info, or end up reading the wrong document.

What I am deliberately not measuring

  • Tool adoption: prompts per engineer per week, percentage of code with AI involvement, Copilot acceptance rates. Adoption tells me whether engineers opened the tool. It does not tell me whether their work got better. Many things get adopted and quietly stop being useful, and the adoption number does not catch it.
  • Self-reported productivity: engineers reliably report that AI tools save them time. That is worth knowing, but it is not a measurement. The correlation between self-reported time savings and actual cycle time is weaker than you would hope.

What surprised me

The platform has to earn the second use. It is easy to get engineers to try a new tool once. Getting them to use it on their actual third task of the day is the bar. If answers are wrong, partial, or unclear about sources, engineers stop trusting the tool - and once trust goes, you do not get it back from that engineer for a long time.

Citations matter more than answer quality. A roughly correct answer with sources is more useful than a more correct answer without them, because engineers can verify and adjust. This shaped the retrieval and response layer.

The hard part is the data, not the model. The biggest factor in whether the platform is useful is whether the underlying documentation, ADRs, runbooks, and code comments are good. AI tooling makes good documentation more valuable and bad documentation more visible. I have come to think of this as a forcing function for documentation discipline - a side effect I did not expect.

ProServ as sounding board

This lesson comes from modernization work rather than the AI platform itself, but it applies here too: outside expertise is useful as a second opinion, not as a substitute for internal ownership. AWS ProServ was valuable in the GovCloud work because they had recent platform-specific judgment we did not. The same pattern holds for AI tooling vendors and consultants. They can pressure-test architecture, retrieval strategy, evaluation methods, and security posture. They cannot know whether the workflow will survive contact with my engineers on a normal Tuesday.

The ownership has to stay inside the team. If the people who live with the system do not understand how answers are grounded, how sources are refreshed, how failures are handled, and what the tool should refuse to answer, the rollout is fragile from day one.

Customer cutover is engineering work

Internal AI platforms have cutovers too. They are just quieter than cloud migrations. A team moves from asking the senior engineer in Slack to asking the platform. Support triage moves from manual context gathering to a generated first pass with citations. Onboarding moves from "read these five docs" to "ask grounded questions, then verify the source."

That transition needs engineering discipline: clear scope, source freshness rules, access boundaries, feedback capture, rollback paths, and explicit guidance on when not to trust the answer. Treating rollout as a comms exercise misses the point. The adoption path is part of the system.

What I am not ready to claim yet

I would want at least two quarters of data before I would put a multiplier number on this work. Most published AI productivity numbers were measured too early, in conditions that do not generalize. I would rather under-claim now and have a defensible number later than overclaim and walk it back.