infrastructureDecember 7, 202513 min read

CI/CD Pipeline Setup That Actually Works in Production

Battle-tested CI/CD patterns for real projects — from GitHub Actions workflows to deployment strategies that do not break production on Friday.

cicdgithub-actionsdevops

CI/CD Pipeline Setup That Actually Works in Production

The first CI/CD pipeline I set up for a production project was a single GitHub Actions workflow that ran npm test and then deployed to a VPS via SSH. It worked until it did not — a failed deployment left the server in a half-updated state on a Friday evening, and I spent the weekend manually reverting files. That experience taught me that a deployment pipeline is not just "run tests then deploy." It is the entire system of checks, gates, and rollback mechanisms that stand between a git push and your users seeing new code.

This post covers the pipeline architecture I have refined across multiple production projects, with concrete GitHub Actions examples you can adapt.

Pipeline Architecture

A production pipeline has distinct stages, and each stage has a specific purpose. Skipping stages saves minutes now and costs hours later.

The Stages

Code Push
  │
  ├─→ Stage 1: Validation (lint, format, type-check)
  │
  ├─→ Stage 2: Testing (unit, integration)
  │
  ├─→ Stage 3: Build (compile, bundle, containerize)
  │
  ├─→ Stage 4: Deploy to Staging
  │
  ├─→ Stage 5: Smoke Tests / E2E on Staging
  │
  └─→ Stage 6: Deploy to Production

Each stage acts as a gate. If validation fails, tests never run. If tests fail, the build never starts. This saves compute time and provides fast feedback — developers know within 30 seconds if they forgot to run the linter, rather than waiting 8 minutes for a test suite to fail.

Trigger Strategy

Not every push needs the full pipeline. Here is the trigger configuration I use:

# .github/workflows/ci.yml
name: CI
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]
    types: [opened, synchronize, reopened]

Pull requests run stages 1-3 (validate, test, build). Merges to develop deploy to staging. Merges to main deploy to production. This keeps PR feedback fast while ensuring deployments only happen from protected branches.

Stage 1: Validation

Validation catches formatting inconsistencies and type errors before tests run. These checks are fast (under 30 seconds) and catch the most common issues.

jobs:
  validate:
    name: Validate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Check formatting
        run: npx prettier --check "src/**/*.{ts,tsx,js,jsx,json,css}"

      - name: Lint
        run: npx eslint src/ --max-warnings 0

      - name: Type check
        run: npx tsc --noEmit

The `--max-warnings 0` Rule

ESLint distinguishes between errors (which fail the process) and warnings (which do not). Without --max-warnings 0, teams accumulate hundreds of warnings that everyone ignores. Treating warnings as errors in CI forces the team to either fix them or explicitly disable the rule. No middle ground.

Formatting as a CI Check, Not Just a Suggestion

Running Prettier in CI (with --check, not --write) ensures consistent formatting without relying on every developer having the right editor extension. If formatting fails in CI, the developer runs npx prettier --write . locally and commits the fix. This is non-negotiable — formatting debates end when a tool makes the decisions.

Stage 2: Testing

Testing is the backbone of the pipeline. I split tests into parallel jobs based on type for faster feedback.

  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: validate
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run unit tests
        run: npx vitest run --coverage --reporter=verbose

      - name: Upload coverage report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/

  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: validate
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run database migrations
        run: npx prisma migrate deploy
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/testdb

      - name: Run integration tests
        run: npx vitest run --config vitest.integration.config.ts
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379

Service Containers for Integration Tests

GitHub Actions service containers are underused. Instead of mocking your database in integration tests (which tests the mock, not your code), spin up a real PostgreSQL instance. The services block handles lifecycle management — the container starts before your tests and stops after.

This adds about 15-20 seconds to the job for container startup, but the confidence you get from testing against a real database is worth it.

Parallel Test Execution

Unit tests and integration tests run in parallel (both needs: validate, not needs: unit-tests). This reduces total pipeline time. If your unit tests take 2 minutes and integration tests take 4 minutes, parallel execution means you wait 4 minutes instead of 6.

Stage 3: Build

The build stage validates that the project compiles and produces deployable artifacts.

  build:
    name: Build
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests]
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build
        env:
          NODE_ENV: production

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: dist/
          retention-days: 7

Build Caching

For Docker-based deployments, layer caching dramatically speeds up builds:

  build-docker:
    name: Build Docker Image
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests]
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: |
            ghcr.io/${{ github.repository }}:${{ github.sha }}
            ghcr.io/${{ github.repository }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

The cache-from: type=gha uses GitHub Actions cache to store Docker layers between runs. For a typical Node.js application, this reduces build time from 3-4 minutes to 30-60 seconds for dependency-only changes.

Dockerfile Best Practices for CI

The Dockerfile structure directly impacts build cache efficiency. Order layers from least-frequently changed to most-frequently changed:

FROM node:20-alpine AS base

# System dependencies (rarely changes)
RUN apk add --no-cache libc6-compat

# Package manifests (changes when dependencies change)
WORKDIR /app
COPY package.json package-lock.json ./

# Install dependencies (cached unless manifests change)
FROM base AS deps
RUN npm ci --production

FROM base AS build
RUN npm ci
COPY . .
RUN npm run build

# Production image
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./

EXPOSE 3000
CMD ["node", "dist/server.js"]

Multi-stage builds keep the final image small (no dev dependencies, no source code), and the layer ordering ensures npm ci only runs when package.json or package-lock.json changes.

Stage 4: Deployment to Staging

Staging deployment happens automatically on merges to the develop branch. The staging environment should mirror production as closely as possible — same infrastructure, same environment variables (with different values), same scaling configuration.

  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/develop' && github.event_name == 'push'
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Download build artifacts
        uses: actions/download-artifact@v4
        with:
          name: build-output
          path: dist/

      - name: Deploy to Cloud Run (Staging)
        uses: google-github-actions/deploy-cloudrun@v2
        with:
          service: my-api-staging
          region: us-central1
          image: ghcr.io/${{ github.repository }}:${{ github.sha }}
          env_vars: |
            NODE_ENV=staging
            DATABASE_URL=${{ secrets.STAGING_DATABASE_URL }}

GitHub Environments

The environment key in the workflow enables GitHub's environment protection rules. You can require manual approval, restrict which branches can deploy, and set environment-specific secrets. For staging, I typically require no approval (auto-deploy). For production, I require at least one reviewer.

Stage 5: Smoke Tests

After deploying to staging, run a basic set of tests against the deployed application to verify it actually works in the real environment.

  smoke-tests:
    name: Smoke Tests
    runs-on: ubuntu-latest
    needs: deploy-staging
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Wait for deployment
        run: |
          for i in {1..30}; do
            status=$(curl -s -o /dev/null -w "%{http_code}" https://staging.example.com/health)
            if [ "$status" = "200" ]; then
              echo "Service is healthy"
              exit 0
            fi
            echo "Waiting for service... (attempt $i)"
            sleep 10
          done
          echo "Service did not become healthy"
          exit 1

      - name: Run smoke tests
        run: npx playwright test --config=playwright.smoke.config.ts
        env:
          BASE_URL: https://staging.example.com

Smoke tests are not full E2E tests. They verify critical paths: can the homepage load, can a user log in, does the main API endpoint return data. Five to ten scenarios that take under 2 minutes. If smoke tests fail, the staging deployment is rolled back and the production deployment does not proceed.

Stage 6: Production Deployment

Production deployment is where the deployment strategy matters most. There are several approaches, each with different trade-offs.

Rolling Deployment

The simplest strategy. New instances are brought up while old instances are drained. At any point during the deployment, some requests hit the old version and some hit the new version. This is the default for most container platforms.

Pros: Simple, no extra infrastructure cost. Cons: Two versions serve traffic simultaneously, which can cause issues with database schema changes or API contract changes.

Blue-Green Deployment

Two identical environments (blue and green) exist. One serves traffic while the other is idle. Deploy to the idle environment, verify it works, then switch the router. If something goes wrong, switch back.

  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: smoke-tests
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment:
      name: production
      url: https://api.example.com
    steps:
      - name: Deploy new revision
        uses: google-github-actions/deploy-cloudrun@v2
        with:
          service: my-api-production
          region: us-central1
          image: ghcr.io/${{ github.repository }}:${{ github.sha }}
          flags: '--no-traffic'

      - name: Run production health check
        run: |
          REVISION_URL=$(gcloud run revisions describe my-api-production-${{ github.sha }} \
            --region us-central1 --format='value(status.url)')
          status=$(curl -s -o /dev/null -w "%{http_code}" "$REVISION_URL/health")
          if [ "$status" != "200" ]; then
            echo "Health check failed"
            exit 1
          fi

      - name: Migrate traffic
        run: |
          gcloud run services update-traffic my-api-production \
            --region us-central1 \
            --to-latest

The --no-traffic flag deploys the new revision without sending it any traffic. After the health check passes, traffic is shifted to the new revision. If the health check fails, the workflow stops and the old revision continues serving.

Canary Deployment

Route a small percentage of traffic (5-10%) to the new version while monitoring error rates, latency, and key business metrics. If metrics look good after a defined period, gradually increase traffic. If metrics degrade, roll back.

      - name: Canary - route 10% of traffic
        run: |
          gcloud run services update-traffic my-api-production \
            --region us-central1 \
            --to-revisions=my-api-production-${{ github.sha }}=10

      - name: Monitor canary (5 minutes)
        run: |
          sleep 300
          # Check error rate for the canary revision
          ERROR_RATE=$(curl -s "$MONITORING_API/error-rate?revision=${{ github.sha }}")
          if (( $(echo "$ERROR_RATE > 1.0" | bc -l) )); then
            echo "Error rate too high: $ERROR_RATE%. Rolling back."
            gcloud run services update-traffic my-api-production \
              --region us-central1 \
              --to-latest
            exit 1
          fi

      - name: Promote canary to 100%
        run: |
          gcloud run services update-traffic my-api-production \
            --region us-central1 \
            --to-latest

When to use canary: When you have enough traffic that 10% still generates statistically meaningful error rates. For a service handling 100 requests/minute, 10% gives you 10 requests/minute — enough to detect elevated error rates within a few minutes. For a service handling 10 requests/minute, canary deployments are not meaningful.

Rollback Mechanisms

Every deployment must have a documented rollback path. "Re-deploy the old version" is a rollback strategy, but it is slow. Better options:

Instant Rollback via Traffic Shifting

If you deploy with --no-traffic and shift traffic, rollback is a single command:

# Roll back to the previous revision
gcloud run services update-traffic my-api-production \
  --region us-central1 \
  --to-revisions=PREVIOUS_REVISION=100

This takes effect in seconds because the old revision is still running.

Automated Rollback

Add a post-deployment monitoring step that automatically rolls back if error rates spike:

      - name: Post-deploy monitor
        run: |
          for i in {1..10}; do
            sleep 60
            ERROR_RATE=$(curl -s "$MONITORING_API/error-rate?window=5m")
            if (( $(echo "$ERROR_RATE > 2.0" | bc -l) )); then
              echo "Error rate $ERROR_RATE% exceeds threshold. Rolling back."
              gcloud run services update-traffic my-api-production \
                --region us-central1 \
                --to-revisions=$PREVIOUS_REVISION=100
              exit 1
            fi
            echo "Minute $i: Error rate $ERROR_RATE% — OK"
          done

Database Migration Rollback

This is the hardest part. If your deployment includes database schema changes, rolling back the application without rolling back the database creates a mismatch. The solution is expand-and-contract migrations:

Expand: Add new columns/tables without removing old ones. Make new columns nullable or with defaults.
Deploy: New code writes to both old and new columns. Reads from new columns with fallback to old.
Migrate data: Backfill new columns from old data.
Contract: Once verified, deploy code that only uses new columns. Then drop old columns.

This makes every step independently reversible.

Secrets Management

Never hardcode secrets. Never commit .env files. Here is how I handle secrets in GitHub Actions:

GitHub Secrets

For most cases, GitHub's built-in secrets are sufficient. They are encrypted, never exposed in logs, and scoped to repositories or organizations.

env:
  DATABASE_URL: ${{ secrets.DATABASE_URL }}
  API_KEY: ${{ secrets.API_KEY }}

Environment-Scoped Secrets

Different secrets for staging and production:

deploy-staging:
  environment: staging
  # ${{ secrets.DATABASE_URL }} resolves to the staging value

deploy-production:
  environment: production
  # ${{ secrets.DATABASE_URL }} resolves to the production value

External Secret Managers

For larger teams or stricter compliance requirements, use AWS Secrets Manager, Google Secret Manager, or HashiCorp Vault. The application fetches secrets at runtime rather than receiving them as environment variables.

import { SecretManagerServiceClient } from '@google-cloud/secret-manager';

const client = new SecretManagerServiceClient();

async function getSecret(name: string): Promise<string> {
  const [version] = await client.accessSecretVersion({
    name: `projects/my-project/secrets/${name}/versions/latest`,
  });
  return version.payload?.data?.toString() || '';
}

Monitoring Deployments

A deployment is not done when the code is live. It is done when you have confirmed the code is working.

Deployment Notifications

Send deployment events to Slack with relevant context:

      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v2
        with:
          webhook: ${{ secrets.SLACK_WEBHOOK }}
          webhook-type: incoming-webhook
          payload: |
            {
              "text": "${{ job.status == 'success' && 'Deployed' || 'Failed to deploy' }} `${{ github.sha }}` to production",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*${{ job.status == 'success' && ':white_check_mark: Deployment Successful' || ':x: Deployment Failed' }}*\nCommit: `${{ github.sha }}`\nAuthor: ${{ github.actor }}\nMessage: ${{ github.event.head_commit.message }}"
                  }
                }
              ]
            }

Post-Deployment Health Checks

After every production deployment, verify critical endpoints respond correctly:

      - name: Verify production health
        run: |
          endpoints=("/" "/api/health" "/api/v1/status")
          for endpoint in "${endpoints[@]}"; do
            status=$(curl -s -o /dev/null -w "%{http_code}" "https://api.example.com$endpoint")
            if [ "$status" != "200" ]; then
              echo "FAILED: $endpoint returned $status"
              exit 1
            fi
            echo "OK: $endpoint returned $status"
          done

A Complete Pipeline Example

Putting it all together, here is a condensed but complete pipeline for a Node.js application deployed to Cloud Run:

name: CI/CD Pipeline
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'npm' }
      - run: npm ci
      - run: npx prettier --check "src/**/*.{ts,tsx}"
      - run: npx eslint src/ --max-warnings 0
      - run: npx tsc --noEmit

  test:
    needs: validate
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env: { POSTGRES_USER: test, POSTGRES_PASSWORD: test, POSTGRES_DB: testdb }
        ports: ['5432:5432']
        options: --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'npm' }
      - run: npm ci
      - run: npx vitest run --coverage
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/testdb

  build-and-push:
    needs: test
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment:
      name: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
    steps:
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}
      - uses: google-github-actions/deploy-cloudrun@v2
        with:
          service: my-api-${{ github.ref == 'refs/heads/main' && 'prod' || 'staging' }}
          region: us-central1
          image: ghcr.io/${{ github.repository }}:${{ github.sha }}

The concurrency block is worth noting — it cancels in-progress runs for the same branch on pull requests, so pushing a fix while CI is still running does not queue up two pipelines.

What I Would Do Differently Starting Over

If I were setting up a pipeline from scratch today, I would start with the validation and test stages only — no deployment automation. Ship manually (or with a simple deploy script) for the first few weeks while the codebase stabilizes. Add deployment automation once the manual process becomes the bottleneck, not before. Premature pipeline optimization is just as real as premature code optimization, and debugging a broken deployment pipeline at 2 AM is significantly worse than running ./deploy.sh by hand.

Related Projects

RestoHub

Restaurants stop losing 30% to Uber Eats — they get their own ordering, menu, website, and loyalty system in one platform. Full Uber Eats-style experience, but the restaurant keeps every dollar.

TakeCare

One nurse now monitors 250 patients remotely — replacing manual phone calls and home visits across Quebec's largest hospitals. Live in Jewish General, CHUM, and Douglas Mental Health Institute.

Danil Ulmashev

Full Stack Developer

Need a senior developer to build something like this for your business?

Book a discovery call Work