Integrating cloud-based vector databases with CI/CD Pipelines

What is CI/CD?
Challenges testing cloud-based managed services in CI/CD pipelines
Managing monorepo complexity in CI/CD
Effective testing of Cloud-based services
Optimizing resource management and cost control
Deploying Pinecone with Infrastructure as Code

What is CI/CD?

Continuous Integration (CI) and Continuous Delivery (CD) are software development best practices. CI automates integrating code changes from multiple contributors into a single software project, providing rapid feedback via unit and integration tests and early detection of issues.

A continuous integration workflow

CD extends this automation, promoting code through all stages of the production pipeline to enable quick and reliable releases.

These practices enhance developer productivity and satisfaction and decrease the number of defects in production.

This chapter offers practical recommendations for managing CI/CD pipelines with cloud-based vector databases like Pinecone.

Challenges testing cloud-based managed services in CI/CD pipelines

There are a number of challenges that arise when testing cloud-based services in CI/CD, including keeping test jobs originating from the same monorepo isolated, testing non-local services that can’t be containerized, and ensuring that tests running concurrently in a busy environment do not interfere with each other by reading from or writing to a common data source.

We’ll consider each of these challenges in turn, alongside its solution, using our open-source example repository that smoothly manages CI/CD with Pinecone’s cloud-native vector database. You can fork, clone, and use our starting point and GitHub actions to quickly set up a similar workflow for your organization.

This chapter and companion repository uses GitHub, but the same principles we’re examining here apply to other version control and CI/CD systems.

Managing monorepo complexity in CI/CD

Challenge: the monorepo pattern complicates development by multiple teams, leading to potential conflicts at runtime

A common scenario complicating CI/CD tasks is the monorepo owned by several development teams. In our example repository, imagine that team A is responsible for the recommendation service, while team B is responsible for the question_and_answer service.

The monorepo pattern has a number of advantages, but it complicates concurrent development by multiple teams, as overlapping CI/CD processes can interfere with each other.

Each team's updates can trigger CI/CD processes that need to be isolated yet efficient.

Solution: Create separate GitHub Actions workflows triggered by changes in specific file paths

We define CI/CD processes that trigger based on changes in specific directories. This ensures that only relevant tests and deployments are triggered, preventing unnecessary builds and conserving resources.

In our sample repository, the .github/workflows/github-actions-recommendation.yaml file specifies the recommendation subdirectory in its paths array, telling the GitHub Actions service to only run this workflow if the file at that specific path has changes:

name: CI/CD - Recommendation
run-name: Recommendation - ${{ github.actor }} - ${{ github.event_name }} - ${{ github.sha }}
on: 
  push:
    paths:
      - 'recommendation/**'


... remainder of workflow follows ...

We do the same thing for the question_and_answer CI/CD workflow in its respective file. We run the Question-Answer workflow whenever the question_answer directory changes.

name: CI/CD - Question-Answer
run-name: Question-Answer - ${{ github.actor }} - ${{ github.event_name }} - ${{ github.sha }}
on: 
  push:
    paths:
      - 'question_answer/**'

... remainder of workflow follows ...

Each development team now fires only its own jobs by working on their area of responsibility.

Effective testing of Cloud-based services

Challenge: Non-local cloud services prevent traditional mocking and containerization strategies

Managed cloud services offer many advantages, including ease of use, reliability, and ongoing security maintenance.

Everything in engineering is a trade off, and one of the difficulties of using managed cloud services is that you cannot easily replicate the service within a container or software mocks in your pipeline.

In a traditional CI/CD context, the fact that Team A’s job is running in a container is the boundary that provides strong isolation guarantees. How can we accomplish something similar when every team needs to hit the same cloud API?

Solution: Dynamically provision resources for testing, and scope jobs to specific resources

In a typical CI/CD environment, it’s common to see multiple concurrent test runs, especially when the repository triggers tests for every pull request, every new commit on a branch and every merge to main and release.

We can create the cloud resources necessary for testing at the beginning of our test runs. In our companion repo, https://github.com/pinecone-field/cicd-demo/blob/main/setup.py demonstrates how to provision four Pinecone indexes that will serve as our testing resources:

question-answer-ci
recommendation-ci
question-answer-production
recommendation-answer-production

You could either run this script manually when setting up your project or programmatically as part of your setup workflow. We’ll use these four indexes in future tests and allow separate teams to read and write from them as needed, providing high-level isolation between tests.

Note that both the workflows above set their own value for the Pinecone index they’re using: PINECONE_INDEX_RECOMMENDATION and PINECONE_INDEX_QUESTION_ANSWER.

To further scope down CI/CD jobs to ensure they’re not interfering with our partner teams’ jobs, we can use environment variables to ensure jobs only interact with their designated resources.

The CI/CD jobs set two environment variables to ensure that all unit tests run in a dedicated CI index with a unique namespace in Pinecone. This ensures each job’s runtime and test data are isolated.

In addition, each workflow creates a brand new Pinecone namespace, within their respective index, mapped to the commit ID of the current run, ensuring isolation from past and future runs:

PINECONE_NAMESPACE: ${{ github.sha }}

Here’s what this flow looks like at runtime, when a developer pushes a new commit:

Separating workflows with discrete file path targets

When code is committed to the “question_answer” sub-folder the “CI/CD - Question-Answer” job executes in its private index and within a new namespace unique to that run.

When code is committed to the “recommendation” sub-folder the “CI/CD - Recommendation” job executes in the same way.

When the namespace is allocated, the job takes care of upserting a sample set of vectors to test against.

Optimizing resource management and cost control

Challenge: Failing to clean up cloud resources leads to higher cloud bills at the end of the month

When it’s time to tear down, the namespace is cleaned up and deleted:

jobs:
  Build-RunDataPipeline-RunTests-Teardown:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run data pipeline - upsert
        run: python question_answer/data_pipeline.py upsert
      - name: Run question_answer tests
        run: python -m unittest question_answer.test
      - name: Run data pipeline - delete
        if: always()
        run: python question_answer/data_pipeline.py delete

Note that each workflow’s last job is set to run `always` - meaning that it will execute the teardown operation at the end of each test run by calling the delete method within the data_pipeline.py file, which uses the same environment variables from earlier steps to clean up the test namespace.

This teardown process addresses our cost requirement as well. By ensuring that test resources are cleaned up at the end of every test run, we decrease the chances of a surprise on our cloud bill.

Deploying Pinecone with Infrastructure as Code

Infrastructure as Code (IaC) is a programming paradigm that describes your desired cloud infrastructure declaratively. IaC tools read this declaration to reconcile your desired state with cloud provider APIs to create reproducible deployments across teams.

Pinecone supports two of the most popular IaC tools, Terraform and Pulumi. In this section, we provide tips for using Infrastructure as Code tools with Pinecone in your CI/CD workflows.

Environment Variables

Integrating IaC with CI/CD tools is often facilitated by using environment variables and custom workflows, providing a dynamic and flexible approach to managing complex environments.

These are essential in customizing the behavior of IaC scripts across different stages of a CI/CD pipeline. By leveraging environment variables, teams can dynamically adjust infrastructure configurations such as server sizes, region settings, and feature flags without altering the core IaC scripts.

Consider the following Terraform variables file:

variable "instance_size" {
  type    = string
  default = "t2.micro"
}

variable "region" {
  type    = string
  default = "us-west-1"
}

provider "aws" {
  region = var.region
}

resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = var.instance_size
}

In a CI/CD environment, you can specify values by setting the following environment variables:

export TF_VAR_instance_size="t2.large"
export TF_VAR_region="us-east-1"
terraform apply

This approach decouples the configuration from the code, allowing for safer and more predictable changes that can be managed and reviewed through version control systems.

Custom Workflows with GitHub Actions

CI/CD tools allow the creation of custom workflows that can include steps to initialize, plan, apply, and destroy infrastructure managed by IaC tools.

name: Terraform CI/CD

on:
  push:
    branches:
      - main
      - develop
  pull_request:
    branches:
      - main
      - develop

jobs:
  terraform:
    name: Terraform
    runs-on: ubuntu-latest

    # Environment variables can be set here
    env:
      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      TF_VAR_instance_size: "t2.micro"
      TF_VAR_region: "us-west-1"

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v1
        with:
          terraform_version: "1.0.0"

      - name: Terraform Init
        id: init
        run: terraform init

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=tfplan

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve tfplan

      - name: Terraform Destroy
        if: github.event_name == 'delete'
        run: terraform destroy -auto-approve

      # Manual approval step before applying changes on production
      - name: Wait for manual approval
        if: github.ref == 'refs/heads/main'
        uses: softprops/turnstyle@v1
        with:
          poll-interval-seconds: 10
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

You can configure these workflows to trigger based on specific conditions such as branch pushes, pull requests, or manual approvals, ensuring that infrastructure changes are integrated into the software development lifecycle in a controlled and traceable manner.

GitHub Actions provide a powerful platform to implement CI/CD pipelines directly within your GitHub repositories. When paired with IaC tools like Terraform and Pulumi, actions automate the provisioning and management of infrastructure alongside application code, making it an ideal reference setup for CI/CD workflows.

GitLab has similar functionality, and there are other popular CI/CD workflow technologies, including Jenkins, Spinnaker, ArgoCD, and CircleCI. The concepts in this blog apply to any CI/CD tool.

Terraform

As we showed above, CI/CD tools can be paired with Terraform to manage infrastructure.

A typical workflow might include steps to initialize the Terraform configuration, create an execution plan, apply the plan to provision resources and provide feedback on the deployment process. You can find more about Pinecone's Terraform provider at https://docs.pinecone.io/integrations/terraform.

Pulumi

Pulumi takes a different approach by using general-purpose programming languages like JavaScript, Python, or Go to define infrastructure:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Load environment variables
const instanceSize = process.env.INSTANCE_SIZE || "t2.micro";
const region = process.env.AWS_REGION || "us-west-1";

// Set the AWS region
const provider = new aws.Provider("aws-provider", {
    region: region,
});

// Create an AWS instance
const instance = new aws.ec2.Instance("web-server", {
    instanceType: instanceSize,
    ami: "ami-0c55b159cbfafe1f0",
}, { provider: provider });

export const instanceId = instance.id;
export const instanceRegion = provider.region;

CI/CD tools can run Pulumi scripts to deploy infrastructure and then handle application code deployment. Pulumi supports:

Steps for previewing changes.
Updating resources.
Running integration tests to ensure deployments meet the desired state as defined in the code.

You can find out more about Pinecone's Pulumi provider at https://docs.pinecone.io/integrations/pulumi.