Working Code, Wrong Engineering: Why AI-Generated Code Needs System-Definition Tests

Real-World Examples of Drift in Perfectly Working AI-Generated Code Traditional software engineering already depends on tests, CI/CD, dependency management, code review, lockfiles, package manifests, SBOMs, and release controls. Those mechanisms do more than check whether code works. They tie implementation to specification, dependencies to policy, artifacts to builds, and releases to traceable engineering decisions. AI-generated code can blur those assumptions. A model can solve the local programming task while violating the system definition. A working AI-generated artifact is not automatically a correct engineering solution. That is the silent danger: the code can run correctly and pass functional tests, yet still carry hidden system-level drift. This drift can have a major impact because it may surface too late, after the product has reached customers, when the risk is no longer theoretical and has already become real operational, legal, reputational, or economic damage. That drift can take many forms: crossing a data boundary, skipping an audit trail, changing a failure path, calling an unauthorized service, drifting from the approved supply chain, or simply drifting away from the implicit organizational policy and engineering intent the system definition was supposed to preserve. Generated Code and Supply-chain Drift This supply-chain drift is not only about the traditional fear of malicious code injection. It can happen simply because the system definition failed to constrain the generated component clearly enough. The model may choose a framework, dependency, service, persistence pattern, or stack not because it is authorized for that component, but because it is common in training examples, popular in public code, or statistically likely for that kind of task. But a working solution is not always a valid solution in a specific organization, team, or system context. A model might: import an unapproved, deprecated, experimental, or unsupported dependency; call an external SaaS service or send data to a third-party API; depend on a vendor or competitor’s platform; introduce unexpected cost, license incompatibility, legal exposure, or security risk; add operational burden; or violate the approved architectural pattern for that specific component. Enterprise software needs supply-chain security. These drifts are not minor implementation preferences. A generated component may still run, compile, behave correctly and pass tests while introducing risks that directly affect project viability: unexpected cost, operational fragility, license incompatibility, compliance exposure, reputational damage, legal liability, or economic impact. That is why “the code runs” is not enough. Generated Code Drift Examples SaaS Cost Drift A model may generate a working support-ticket classifier that calls an external SaaS API for every ticket. But the component now creates unexpected per-request cost, rate-limit exposure, vendor dependency, and data-boundary risk. Even when the organization is allowed to use that SaaS provider, the generated code may still violate the system definition by using the wrong tier, endpoint, region, model, retention setting, or premium feature. For example, the system definition may allow the standard internal enterprise endpoint, but the generated code may call a premium API, enable extended retention, request high-cost inference options, or use a feature that is not approved for that workload. That is not a functional behavior failure. It is a missing explicit service-usage boundary in the system definition. Operational Burden Drift A generated component can also create operational burden. A model may solve a task by introducing a framework, runtime, message broker, database, sidecar, background worker, plugin system, or deployment pattern that was not planned for that artifact. But the organization now has to deploy, monitor, patch, scale, secure, document, and support an extra operational surface that was never part of the approved system definition. That is not a code bug. It is missing operational validation in the system definition. Supply-Chain License Incompatibility A generated component can also create license drift. For example, if an organization is building a proprietary, closed-source product, the system definition may forbid GPLv3 or AGPLv3 dependencies. A model may still import a GPLv3 or AGPLv3 library because it is common, well-documented, or statistically likely for that task. Open-source cases can be subtler. A model may import an Apache 2.0 library into a component that must remain GPLv2-only, or introduce a GPLv3 library into an artifact that must remain Apache 2.0-only. And it is not just code. It may also pull in datasets, icons, fonts, documentation, or generated assets under Creative Commons NonCommercial or similarly restricted licenses that violate commercial-use policy. The code may compile and the tests may pass, but the resulting artifact can become legally risky, noncompliant, or unsuitable for the intended distribution model. That is not a compilation problem. It is a missing licensing and distribution target in the system definition. Other Drift Examples Other drifts follow the same pattern. Generated code may also create lifecycle drift when it uses experimental APIs, preview framework features, unstable SDKs, or old deprecated functions because they are common in public documentation, tutorials, or older codebases. This may not fail immediately, but it can create reliability, maintainability, and future migration risks. The code works today, but the organization inherits tomorrow’s technical debt. Generated code may also create evaluated supply-chain drift when it moves away from approved, certified, or long-supported stacks that the organization depends on for compliance, support, and supply-chain assurance. In highly regulated industries and for ISVs, the expected stack may not be the newest or most popular option. The system definition may explicitly require conservative, long-life-cycle stacks, such as enterprise Linux distributions, LTS runtimes, certified repositories, or vendor-supported components. The goal is to keep the component supportable, patchable, and compliant. By making that path explicit, AI-generated code can actually reduce the usual developer friction around enterprise-curated stacks, which are sometimes bypassed because the latest public GitHub version appears simpler. Instead of defaulting to that unverified public version, the system definition forces the model to generate the compatibility code, adapters, or workarounds needed to stay inside the approved stack. Generated code may also create internal-platform drift when it moves away from the organization’s approved internal platform strategy. A model may choose a popular public library, cloud-native service, or external pattern even when the organization has an approved internal library, self-hosted platform, shared service, or strategic technology direction. This can increase operational cost, fragment architecture, weaken platform governance, and hurt product positioning when a company wants new products to promote or reuse its own technology stack. It can also expose or depend on functions that the organization intended to reserve for premium products, higher-tier offerings, or differentiated market positioning. Generated code may also create optimization drift when a component needs to meet a specific performance target, reduce external calls, run offline, operate as a standalone module, minimize memory use, avoid unnecessary retries, or preserve latency budgets. Even when those constraints are hard to guarantee during code generation, they must be written into the system definition so they can be tested later through performance checks, dependency checks, call-count limits, offline-mode tests, or additional CI/CD validation gates. Generated code may also create failure-behavior drift . Timeout values, retry limits, fallback responses, hard failures, partial-result behavior, circuit breakers, and alternative responses should not be left to whatever pattern appears common in public examples. They should be explicit because failure behavior often defines whether a system is safe, reliable, costly, or compliant under real operating conditions. For example, most application runtimes are expected to fail gracefully or retry. But in a kernel, driver, safety-critical component, or low-level system function, a hard failure or panic may be the safer and more correct behavior. That difference must be explicit in the system definition. Taken together, these examples show why generated-code drift is not just an implementation concern. These drifts are easy to miss because they do not always look like code failures. Their impact can extend beyond the generated artifact. They can affect the viability of the whole project: whether it can be operated, maintained, supported, distributed, governed, or positioned in the market. These issues are especially dangerous because teams are rarely looking for them when the code appears to work. They may remain hidden until they reveal themselves too late, after it is no longer a risk and has become operational, legal, reputational, or economic damage. These are not problems that better code alone can solve. These require a stricter, more explicit, and testable system definition. System Definition Assurance AI-assisted software engineering needs system-definition certainty. The question is no longer only: “Does the code work?” The question becomes: “Can we prove this generated code conforms to the system definition that authorized it?” This issue is common across supply-chain drift, SaaS cost drift, license drift, data-boundary violations, audit-path changes, operational burden, and architectural drift. The model does not need to act maliciously. It only needs to choose an implementation path that is technically useful but invalid for the intended artifact. This is the reason system definition matters: it captures the engineering intent that generated code must preserve, not just the behavior it must reproduce. And this is why system definition itself needs to be tested: it must capture the implicit engineering intent that humans treat as obvious while coding — the organizational common sense that is rarely written down, and that AI cannot be assumed to know. That is not a functional bug. It is a system-definition failure. That is why traditional tests are no longer enough. What System Definition Actually Means System definition is not a rejection of traditional requirements engineering. A strong SRS already covers many functional and non-functional aspects. But AI-generated code creates new pressure. Constraints that humans once carried in team memory, architecture practice, approved stacks, security policy, deployment patterns, and supply-chain rules must now be made explicit, versioned, testable, and enforceable. While AI coding can produce software-shaped output — code that looks correct and satisfies the immediate prompt — system definition is the cast that turns generated code into a real engineering solution. It is the architectural control surface that makes generated artifacts governable, verifiable, traceable, and safe for production. Its purpose is not merely documentation. Its purpose is to turn AI code generation from a code-production process into a true engineering process. In practice, it focuses on four practical layers: Topology — components, services, workflows, users, external systems, and their relationships. Contracts — APIs, schemas, interfaces, preconditions, invariants, and behavioral guarantees. Constraints — approved stacks, libraries, security rules, data boundaries, cost limits, deployment rules, and supply-chain rules. Evaluation — tests, policy checks, observability, artifact hashes, and provenance records. System definition is not a static artifact. As new components are generated, tests fail, incidents occur, policies change, or the architecture evolves, the definition itself must be updated. This ongoing refinement is necessary to keep generated code aligned with current constraints, risks, and architectural intent. There may be many ways to represent a system definition: structured Markdown, YAML, policy-as-code, architecture documents, deployment templates, or validated specification formats. The format is not the main point. But format selection is not neutral. If the system definition is meant to be consumed by code-generation agents, verification agents, and CI/CD gates, some boundaries should be represented in more formal formats. Natural language may be enough to guide generation, but deterministic checks need structured values: allowed repositories, denied licenses, memory limits, approved endpoints, SBOM rules, architecture references, fallback paths, and policy identifiers. The goal is not to force every part of system definition into one rigid schema. The goal is to make each boundary as formal as it needs to be for the kind of verification it must support. The important properties are that the definition must be explicit, versioned, testable, referenceable, and enforceable. Example: System Definition Boundaries The following example is intentionally incomplete. It is not a universal template, a full system definition, or a required file format. It only shows how some system-definition boundaries might be written so they are less vague, more referenceable, and easier to test. The important point is that vague terms such as “approved,” “internal,” “secure,” “low cost,” or “compliant” should point to concrete policies, versions, limits, repositories, architectures, endpoints, and fallback paths wherever possible. \ component: Support-ticket classifier purpose: Classify support tickets by urgency and category. boundaries: runtime: Must use only the approved internal runtime defined in definitions.runtime_reference. No new runtimes or frameworks allowed. language_version: Must use the approved language, version, interpreter, and package manager defined in definitions.language_version_policy. No alternatives allowed without explicit approval. stack_platform: Must use only packages from SUSE Linux Enterprise Server 16 XYZ-Internal repositories defined in definitions.repository_policy. Public and non-certified sources are strictly forbidden. lifecycle_stability: Must use only stable, LTS, vendor-supported components. Experimental, preview, beta, or deprecated features are forbidden. internal_platform: Must use the organization’s approved internal libraries and platforms defined in definitions.internal_platform. Public alternatives are not permitted unless explicitly approved. architecture: Must follow the approved architecture defined in definitions.architecture_reference. No alternative persistence, communication, or integration patterns allowed without explicit approval. licensing_compatibility: Must conform to the artifact’s approved distribution model defined in definitions.distribution_model. Non-commercial, proprietary-only, or incompatible licenses are forbidden. Applies to direct and transitive dependencies, datasets, icons, fonts, images, and generated assets. assets: Must use only assets from definitions.asset_sources. Common, public, or unverified assets are forbidden. external_service: No external SaaS APIs allowed. Model inference must use only the approved internal enterprise AI endpoint defined in definitions.internal_model_endpoint, with at most one inference call per workflow. data: All ticket data must remain inside the approved region defined in definitions.data_region_policy. External transmission is forbidden. cost: Premium features, extended retention, or high-cost options require explicit prior approval per definitions.premium_feature_policy. security: Must implement approved security patterns and enforce least-privilege access per definitions.security_patterns. optimization: Must respect defined resource limits per definitions.resource_limits. audit: Must preserve an immutable audit trail for every classification per definitions.audit_policy. operational: Must not introduce any new database, broker, worker, sidecar, monitoring surface, runtime, or deployment pattern per definitions.allowed_operations. failure: Must use only the approved fallback path defined in definitions.fallback_policy. Ad-hoc retries and custom fallback logic are forbidden. supply_chain: Must generate a valid SBOM and comply with all policies defined in definitions.supply_chain_policy. traceability: Must be fully traceable to the originating prompt, system-definition version, model context, test results, SBOM, provenance records, and build artifacts. # Definitions are versioned authoritative references and concrete values used by the boundaries above. # In a production system, each definition should also include ownership, version, # purpose, validation rules, and links to the authoritative policy source. definitions: runtime_reference: name: approved_internal_runtime version: "2026.05" language_version_policy: language: Python version: "3.12" interpreter: CPython package_manager: pip lockfile_required: true repository_policy: platform: SUSE Linux Enterprise Server 16 profile: XYZ-Internal public_sources_allowed: false non_certified_sources_allowed: false internal_platform: sources: - Internal library catalog - Approved platform catalog architecture_reference: name: Architecture v2026.05 distribution_model: target: GPL-compatible asset_sources: allowed: - Approved internal asset repositories internal_model_endpoint: endpoint: Internal enterprise AI endpoint max_inference_calls_per_workflow: 1 data_region_policy: approved_region: approved-region external_transmission_allowed: false premium_feature_policy: purpose: Defines which cost-impacting features require explicit prior approval. explicit_approval_required_for: - premium_features - extended_retention - high_cost_options resource_limits: max_memory: 512Mi max_latency: ... max_internal_service_calls: 3 security_patterns: reference: linked security-pattern policy document audit_policy: reference: linked audit policy document allowed_operations: reference: linked allowed-operations policy document fallback_policy: purpose: Defines the only approved behavior when confidence is low, errors occur, or the model is unavailable. reference: linked fallback policy document supply_chain_policy: purpose: Defines required supply-chain evidence for generated artifacts. reference: linked supply-chain policy document Stronger Tests Mean Boundary Tests For AI-generated code, stronger unit tests should not only validate behavior. They should also help validate boundaries. That means checking whether generated code uses approved dependencies, respects package manifests and lockfiles, avoids unauthorized external calls, preserves audit paths, follows failure-handling rules, and stays inside the architecture defined for that component. It also means connecting tests to supply-chain evidence. SBOMs, artifact hashes, signatures, provenance records, and dependency validation should not be treated as separate paperwork after the code is already accepted. For generated code, they are part of the verification surface. The generated artifact should be traceable back to the prompt, system definition, approved dependency policy, model/tool context, build environment, test results, and review path. Otherwise, the organization only knows that the code worked in a local test. It does not know whether the code was generated, assembled, and built according to the system definition that made it valid. These checks include contract validation, policy checks, dependency checks, SBOM validation, provenance records, artifact hashes, evaluation harnesses, and regression tests tied back to the original definition. They are not replacements for traditional tests. They are the controls that keep generated code connected to specification, approved architecture, supply-chain governance, traceability, and accountability. CI/CD System-Definition Checks Many organizations already scan dependencies, generate SBOMs, sign artifacts, validate provenance, and enforce standard CI/CD policy. These controls remain essential. AI-generated code adds a critical missing layer: the system definition itself must become testable. Building on stronger unit tests, an independent verification agent in the CI/CD pipeline should go beyond “does the code compile?” or “do the unit tests pass?” It must validate the generated code against the complete system definition: approved dependencies, allowed SaaS endpoints and tiers, licensing rules, data boundaries, audit paths, cost limits, operational patterns, and architectural constraints. But verification should not only test the generated code. It should also test the intent and the system definition itself. In traditional development, many constraints lived in implicit human context and team memory. With AI-generated code, those assumptions evaporate. If they are not explicitly written into the system definition, the model fills the gaps, often choosing what is common instead of what is authorized for this specific system. The pipeline should enforce two gates: Definition Gate — Is the system definition complete and explicit enough to constrain generation and make compliance verifiable? Generated-Code Gate — Does the generated code fully conform to that definition? \ Generated-Code Gate as a Hybrid Verifier A practical implementation of the Generated-Code Gate could be an independent hybrid verifier. This verifier should not be the same agent that generated the code. It should run in an isolated, read-only context with access to the exact system-definition version, the generated code, build artifacts, SBOM, provenance records, dependency data, and relevant policy references. The important point is that this verifier should not rely on LLM reasoning alone. Deterministic Tools First A well-designed system definition makes many boundaries directly testable through deterministic tools: dependency scanners, license checkers, SBOM validators, lockfile validation, static analysis, call-graph analysis, policy-as-code engines, base-image scanners, resource tests, and network-egress checks. The agentic verifier then acts as the orchestration and reasoning layer. It reads the system definition boundary by boundary, triggers the appropriate deterministic checks, consumes their results, and applies contextual judgment where ordinary scanners are weak: architectural drift, failure-behavior drift, operational burden, traceability gaps, subtle cost drift, or conflicts between multiple boundaries. Structured Verification Output Its output should be structured enough for CI/CD automation and human review: passed: false confidence: 92 violations: - boundary: external_service description: Generated code calls an unauthorized external SaaS API. location: src/classifier.py:42 rule_violated: "No external SaaS APIs allowed." recommendation: decision: BLOCK reason: Code must be regenerated or fixed before entering standard CI/CD. This turns the Generated-Code Gate into a hybrid evaluation layer: deterministic where the rule is machine-checkable, agentic where interpretation is required, and traceable in both cases. Violations caused by ambiguous or missing constraints should not be patched only in code. They should be routed back to the Definition Gate so the system definition improves before regeneration. Pre-CI/CD Flow The recommended flow looks like this: The Definition Gate surfaces missing or ambiguous rules before generation begins. The Generated-Code Gate catches drift after the code is produced. The Generated-Code Gate checks for concrete drift: unapproved dependencies, unauthorized SaaS endpoints or premium tiers, licensing conflicts, missing audit paths, unplanned runtimes or deployment patterns, and unexpected SBOM or provenance changes. The important distinction is that broken code and definition drift are not the same failure. If the generated code is simply broken under a clear definition, it can be regenerated. But if the code exposes an ambiguous, missing, or incomplete constraint, the process should return to the system definition first. Review does not mean a simple approval button. It means clarifying the system definition, approving a modification, or defining a valid alternative before generation runs again. Review does not mean turning every generated-code uncertainty into a human approval queue. As I argued in Human Approval Will Never Scale as AI Infrastructure , human review should be reserved for unclear, conflicting, or high-impact decisions. Wherever organization-wide boundaries already define the answer, review should be automated through policy checks, definition gates, CI/CD validation, or controlled system-definition updates that do not require human-in-the-loop (HITL) approval. Every review outcome should either improve the definition or confirm that the existing definition is already sufficient. If a dependency is banned, the system definition should say so. If a SaaS tier is allowed only for some workloads, the system definition should say so. If a deployment pattern is unacceptable for this artifact, the system definition should say so. Otherwise, the same ambiguity will appear again in the next generation. Ideally, the verification agent should be separate from the generation agent, so the same system that produced the code is not the only system judging whether the code complied. The generated code should also be checked against a specific version of the system definition, not against a vague or moving description of intent. Only after those gates should the code move into standard CI/CD checks such as build, test, scan, package, sign, and release. At that point, the built artifact can also be checked against its SBOM, signatures, provenance records, artifact hashes, and release policy. After standard CI/CD, the built artifact should preserve the same traceability chain: generated code, system definition version, SBOM, signatures, provenance records, test results, and release approval should all point to the same authorized build. In this model, CI/CD does more than test code after it is written. It ensures generation was properly constrained before it happened, that the resulting code remains aligned with the authorized system definition, that ambiguity improves the definition, and that the final artifact remains traceable to the definition that allowed it. Conclusion AI-generated code does not only need more tests because models make mistakes. It needs stronger tests because models can produce working code from the wrong assumptions. A generated component may satisfy the prompt while violating the system. It may pass functional tests while introducing an unauthorized dependency, changing a failure path, bypassing an audit trail, calling an external service, or drifting from the approved supply chain. That is the real risk. Not only broken code. Working code. Wrong engineering. The standard for AI-generated code should therefore be higher than “the tests passed.” The standard should be: “Can we prove that this artifact works, uses approved dependencies, matches its SBOM, carries valid provenance, and conforms to the system definition that authorized it?” If the answer is no, the code is not ready for production. Even if it works. This is where coding ends and software engineering begins. The next question is bigger: if AI can generate the code, what is the engineering language that defines the system around it? That is the topic of my next article: System Definition Brings Software Engineering to AI Coding . Related Reading The same principle applies beyond generated-code drift: AI systems do not eliminate old engineering patterns and structures — they make them even more important. Other related articles in my AI series: Agentic AI Security Needs Filtered IPO Human Approval Will Never Scale as AI Infrastructure The Only Context Rule Your AI Agents Actually Need \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook