aiOctober 5, 20258 min read

Building AI Features Into Existing Products

A practical playbook for integrating AI into products that already have users — from choosing the right model to production deployment patterns.

aillmintegration

Building AI Features Into Existing Products

Adding AI to a product that already has users is fundamentally different from building an AI-first startup. You have existing infrastructure, established UX patterns, real user expectations, and — most importantly — things that can break. This post covers the patterns I have found reliable after shipping AI features across multiple production applications.

Choosing Between Model Providers

The model provider decision is not permanent, and it should not be treated as one. Each provider has distinct strengths that matter in production.

OpenAI (GPT-4o, GPT-4.1) remains the most battle-tested option for general-purpose text generation. The API is stable, the documentation is thorough, and the ecosystem of tools around it is mature. If you need function calling, structured JSON output, or broad multilingual support, OpenAI is a safe default.

Anthropic (Claude) excels at nuanced instruction following and long-context tasks. When your feature involves processing large documents, maintaining complex system prompts, or handling tasks where the model needs to say "I don't know" rather than hallucinate, Claude tends to perform better. The thinking/reasoning capabilities in Claude models are also strong for multi-step analytical tasks.

Google Gemini is worth considering when your feature involves multimodal input — particularly when you need to process images, video, or audio alongside text within the same request. Gemini's native multimodal architecture avoids the bolted-on feeling of vision features in text-first models. The pricing for high-throughput use cases is also competitive.

The practical answer: start with whichever provider your team knows best, but architect your system so you can switch. Provider lock-in is the real risk, not picking the "wrong" model on day one.

The API Wrapper Pattern

Every AI integration should sit behind an abstraction layer. Not because you will definitely switch providers, but because you will definitely need to add logging, caching, rate limiting, and fallback logic — and you do not want to do that in 40 different places.

interface AIProvider {
  generateText(prompt: string, options?: GenerateOptions): Promise<AIResponse>;
  generateStream(prompt: string, options?: GenerateOptions): AsyncGenerator<string>;
  generateStructured<T>(prompt: string, schema: z.ZodSchema<T>, options?: GenerateOptions): Promise<T>;
}

interface GenerateOptions {
  model?: string;
  temperature?: number;
  maxTokens?: number;
  systemPrompt?: string;
}

interface AIResponse {
  content: string;
  usage: { promptTokens: number; completionTokens: number };
  model: string;
  latencyMs: number;
}

The concrete implementation for a given provider stays thin:

class AnthropicProvider implements AIProvider {
  private client: Anthropic;

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey });
  }

  async generateText(prompt: string, options?: GenerateOptions): Promise<AIResponse> {
    const start = Date.now();
    const response = await this.client.messages.create({
      model: options?.model ?? "claude-sonnet-4-20250514",
      max_tokens: options?.maxTokens ?? 1024,
      temperature: options?.temperature ?? 0.7,
      system: options?.systemPrompt,
      messages: [{ role: "user", content: prompt }],
    });

    const textBlock = response.content.find((b) => b.type === "text");
    return {
      content: textBlock?.text ?? "",
      usage: {
        promptTokens: response.usage.input_tokens,
        completionTokens: response.usage.output_tokens,
      },
      model: response.model,
      latencyMs: Date.now() - start,
    };
  }

  // ... generateStream, generateStructured
}

Then a service layer handles cross-cutting concerns:

class AIService {
  constructor(
    private provider: AIProvider,
    private cache: CacheStore,
    private logger: Logger,
    private fallbackProvider?: AIProvider
  ) {}

  async generate(prompt: string, options?: GenerateOptions): Promise<AIResponse> {
    const cacheKey = this.buildCacheKey(prompt, options);
    const cached = await this.cache.get<AIResponse>(cacheKey);
    if (cached) return cached;

    try {
      const response = await this.provider.generateText(prompt, options);
      this.logger.info("ai_generation", {
        model: response.model,
        tokens: response.usage,
        latencyMs: response.latencyMs,
      });
      await this.cache.set(cacheKey, response, { ttl: 3600 });
      return response;
    } catch (error) {
      if (this.fallbackProvider) {
        this.logger.warn("ai_fallback_triggered", { error: String(error) });
        return this.fallbackProvider.generateText(prompt, options);
      }
      throw error;
    }
  }
}

This pattern pays for itself within the first week. When OpenAI has an outage (and it will), you flip to the fallback provider. When you need to debug a production prompt issue, the logs are already there.

Prompt Engineering in Production

Prompts in production are not strings in your source code. They are a separate concern that needs versioning, testing, and observability.

The template system I use is straightforward:

interface PromptTemplate {
  id: string;
  version: number;
  system: string;
  user: string;
  variables: string[];
}

const LISTING_DESCRIPTION: PromptTemplate = {
  id: "listing-description",
  version: 3,
  system: `You are a professional copywriter for a restaurant platform.
Write compelling menu item descriptions.
Rules:
- Max 2 sentences
- Mention key ingredients
- Never use the word "delicious" or "mouth-watering"
- Match the restaurant's tone: {{tone}}`,
  user: `Write a description for: {{itemName}}
Category: {{category}}
Ingredients: {{ingredients}}`,
  variables: ["tone", "itemName", "category", "ingredients"],
};

function renderPrompt(
  template: PromptTemplate,
  vars: Record<string, string>
): { system: string; user: string } {
  let system = template.system;
  let user = template.user;

  for (const key of template.variables) {
    const value = vars[key];
    if (!value) throw new Error(`Missing variable: ${key}`);
    system = system.replaceAll(`{{${key}}}`, value);
    user = user.replaceAll(`{{${key}}}`, value);
  }

  return { system, user };
}

The version number matters. When you change a prompt, bump the version and log it alongside every request. When a user reports that the AI output changed, you can trace it back to the exact prompt version. Store prompt templates in a database or config file — not hardcoded — so you can update them without redeploying.

Test your prompts like you test code. Maintain a set of input/output fixtures. When you change a prompt, run the fixtures through and manually review the diff. Automated eval is getting better, but human review of prompt changes still catches issues that metrics miss.

Streaming Responses for UX

Users will tolerate a 3-second wait for a complete response. They will not tolerate staring at a spinner for 15 seconds. Streaming solves this.

async function* streamAIResponse(
  provider: AIProvider,
  prompt: string,
  options?: GenerateOptions
): AsyncGenerator<string> {
  const stream = provider.generateStream(prompt, options);

  for await (const chunk of stream) {
    yield chunk;
  }
}

// In your API route (Next.js example)
export async function POST(request: Request) {
  const { prompt, options } = await request.json();

  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of streamAIResponse(aiProvider, prompt, options)) {
          controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text: chunk })}\n\n`));
        }
        controller.enqueue(encoder.encode("data: [DONE]\n\n"));
        controller.close();
      } catch (error) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify({ error: "Generation failed" })}\n\n`)
        );
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

On the client side, consume the stream and update the UI progressively. The perceived performance difference is dramatic — users see content appearing within 200-400ms instead of waiting for the full generation.

One implementation detail that matters: buffer partial words on the client. Some providers send tokens that split mid-word. Accumulate a small buffer and only render complete words to avoid visual jitter.

Cost Control and Caching Strategies

AI API costs can surprise you. A feature that costs $2/day in staging can cost $2,000/day in production if you have not thought about caching.

Semantic caching is the highest-leverage optimization. If two users ask functionally identical questions, serve the cached response. You do not need a vector database for this — start with exact-match caching on normalized inputs. Hash the prompt (after injecting variables) and store the response with a TTL.

Tiered model routing saves money without degrading quality. Not every request needs your most expensive model. Route simple classification tasks to a smaller model and reserve the large model for complex generation:

function selectModel(task: AITask): string {
  switch (task.complexity) {
    case "classification":
    case "extraction":
      return "gpt-4o-mini"; // fast, cheap
    case "generation":
      return "claude-sonnet-4-20250514"; // balanced
    case "reasoning":
      return "claude-opus-4-20250514"; // maximum quality
  }
}

Set hard budget limits. Most providers support usage limits at the API key level. Use them. Also implement application-level rate limiting per user — one abusive user should not burn through your monthly budget in an afternoon.

Track cost per feature, not just total spend. Tag every API call with the feature that triggered it. When the bill arrives, you need to know that the "auto-generate SEO descriptions" feature is 60% of your spend, not just that you spent $X total.

Graceful Degradation

AI features will go down. Provider outages happen. Rate limits get hit. Network requests time out. Your product needs to keep working.

The principle: AI features should enhance the experience, not gate it. If AI-powered search is unavailable, fall back to keyword search. If AI content generation fails, show the user a manual input form. Never put AI in a critical path with no bypass.

Practical implementation:

Timeouts. Set aggressive timeouts on AI calls (10-15 seconds max). A slow response is worse than no response for most UX flows.
Circuit breakers. After N consecutive failures, stop calling the provider for a cooldown period. This prevents cascading failures and avoids burning money on requests that will fail.
Pre-generated fallbacks. For features like product descriptions or recommendations, maintain a set of template-based fallbacks that work without AI. They will not be as good, but they will be something.
UI communication. Tell the user what happened. "AI suggestions are temporarily unavailable" is far better than a generic error or an infinite spinner.

Real-World Examples

AI content generation is the most common integration point. For a marketing platform, this meant building a pipeline that takes a product brief, generates ad copy variations, scores them against brand guidelines (using a second AI call), and presents the top candidates to a human reviewer. The key insight: AI generates, humans curate. The feature that lets users edit and refine AI output is as important as the generation itself.

Computer vision for interior design requires a different architecture. Processing room photos for style analysis and furniture detection involves sending images to a vision model, parsing structured output, and matching results against a product catalog. Latency is higher, so the UX pattern shifts to asynchronous processing with push notifications rather than synchronous wait-and-display.

Intelligent search replaces traditional keyword matching with semantic understanding. For a restaurant platform, this meant indexing menu items with embeddings, so a search for "something spicy and vegetarian" returns relevant results even if those exact words do not appear in any listing. The embedding generation happens at write time (when menus are updated), not at query time — this keeps search fast regardless of AI provider latency.

In each case, the same principles apply: wrap the provider, version the prompts, cache aggressively, and always have a fallback.

Shipping AI Features Responsibly

The gap between an AI demo and a production AI feature is enormous. Demos do not need caching, error handling, cost controls, or graceful degradation. Production does. The patterns in this post are not theoretical — they come from shipping AI features that real users depend on daily.

From AI-powered room redesign to automated content studios, I have shipped AI features across mobile apps, SaaS platforms, and backend systems. The technology is genuinely powerful, but the engineering discipline around it is what determines whether users love the feature or learn to avoid it.

Start with the wrapper pattern, add observability from day one, cache everything that can be cached, and always give users a path forward when the AI is not available. The models will keep getting better. Your job is to make sure the integration is solid enough to take advantage of that.

Danil Ulmashev

Full Stack Developer

Need a senior developer to build something like this for your business?

Book a discovery call Work