Semantic search with OpenAI and Chroma

Semantic search is a type of search technique that goes beyond the simple matching of keywords to provide more accurate and meaningful search results with the help of natural language processing and machine learning. It attempts to understand the intent behind the user's query and the context in which it is made. By analyzing the relationships between words, phrases, and concepts, it can identify the most relevant and useful information for the user, even if it does not contain the exact search terms used.

Let's say you're searching for information on climate change. With a basic keyword search, the search engine would look for documents that contain those exact keywords. The results may include documents that use a combination or variation of those words, such as "temperature change" or "climate crisis". However, with semantic search, the search engine can understand concepts related to climate change and provide more relevant results. For example, it might include pages that discuss carbon emissions, renewable energy, or the Paris Agreement, as these concepts are closely related to the topic of climate change.

In this blog post, we'll explore some of the concepts that make semantic search possible. We'll then build a full-stack application that utilizes semantic search to provide highly relevant search results for a ficticious e-commerce website.

The application we'll be building today.

If you're only interested in the source code, you can find the full project on GitHub.

Fundamentals

Embeddings

At its core, an embedding is a vector (list) of floating point numbers that represent a piece of information, such as a document. The distance between two vectors measures how related they are, with a smaller distance indicating higher similarity and a larger distance indicating lower similarity. This is calculated using mathematical distance functions that operate on vectors, such as cosine similarity.

For embeddings of textual information, the relatedness between words and phrases is measured in a way that is mostly independent of their spelling and pronunciation, and instead relies on their semantics.

Consider the words "cat" and "dog." Though these words are spelled and sound different, they share a commonality: they are both animals that are typically kept as pets. By using embeddings, semantic search can understand that these two words are related in meaning and can provide search results that take this relationship into account.

The smaller the distance the higher the similarity.

Generating embeddings (AKA vectors) that capture the relationships between words using significant computational resources to train machine learning algorithms with large amounts of data. However, there are pre-trained models that are available, which allow us to skip the training step entirely. OpenAI's Embeddings API is one such example, that, given a document, it spits out an embedding. By utilizing this service, we can save a ton of time and resources that would otherwise be spent on setting up and training machine learning algorithms.

Vector databases

To effectively manage and query embeddings, a vector database is key. These databases specialize in storing embeddings and providing optimized vector querying, and are utilized in various applications, including semantic search, image search, and recommender systems. While several vector database technologies exist, in this article we'll be using an open-source one called Chroma. It has an in-memory version which makes it ideal to get our feet wet with.

Putting it into practice

Time to get to work. We need to build search functionality for an e-commerce website that sells all sorts of fitness goodies like workout gear and gym equipment. There should be a search bar that allows users to search an underlying collection of products using semantic search. The search results should be sorted by relevance, with the most relevant products appearing first.

Setting up the database

First things first, we'll need some sample sample. I put ChatGPT to work, asking for 50 real fitness products in JSON format. The AI was happy to oblige, providing a list complete with descriptions, brands and prices:

[
  {
    "id": 1,
    "description": "Fitbit Charge 5 Advanced Fitness Tracker",
    "brand": "Fitbit",
    "price": "$179.95"
  }
  // ... 49 more products omitted
]

Expand to see the ChatGPT prompt

Generate a JSON list of 50 fitness products. 
The products should be products that exist in the real word. 
Each product should contain the following fields:
- id (integer)
- description (string)
- brand (string)
- price (string in USD format)

With these products conveniently saved to a local products.json file, we can now generate embeddings out of the description of each product. We'll be using the OpenAPI Python client to do this, which needs an OpenAI API key. If you don't have one, you get one here. Usage of the API is not free, but it's pretty cheap.

import os
import json
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

def get_embedding(text, model="text-embedding-ada-002"):
    return openai.Embedding.create(input=[text], 
      model=model)['data'][0]['embedding']

with open('products.json', 'r') as f:
    products = json.load(f)

embeddings = []
for product in products:
    embedding = get_embedding(product['description'])
    embeddings.append(embedding)

We now have two lists with the same number of elements, products and embeddings. Each ith product in the products list has a corresponding embedding at the same index in the embeddings list.

Let's store all the products and the embeddings in Chroma, so that we can query them later. The first step is to create a Chroma collection, which can be thought of a table in a relational database:

import chromadb

client = chromadb.Client()
collection = client.create_collection(name="products")

To add documents to a Chroma collection, we use the .add() method, which accepts a list of ids, documents, embeddings, and metadata. If we provide a list of documents but no embeddings, Chroma will tokenize and embed the documents using the collection's embedding function (specified during collection creation).

Alternatively, we can provide a list of pre-computed embeddings but no documents, and use document ids to correlate with documents stored elsewhere (consider a scenario where we store our documents in a traditional database and the embeddings in a vector database and use ids to reconciliate between the two). In our case, however, we will be supplying both the documents and embeddings as a form of convenience, so that all data is available in a single location and querying is simpler.

Any metadata that is provided can be used for additional filtering capabilities during querying, which we'll see an example of later on.

ids = [str(p['id']) for p in products] # Chroma requires ids to be strings
documents = [p['description'] for p in products]
metadata = [{'brand': p['brand'], 'price': p['price']} for p in products]
self.collection.add(
  ids=ids, 
  documents=documents, 
  metadatas=metadata, 
  embeddings=embeddings)

In order to query the collection, we use the query() method with the query_embeddings argument, which takes an embedding of the search query:

query = "heartrate monitor"
embedding = get_embedding(query)
results = collection.query(query_embeddings=[embedding], n_results=2)

The result is a dictionary containing the distances, documents, ids, and metadata of the most relevant products, sorted by most relevant first:

{
  "products": {
    "distances": [[0.26775068044662476, 0.3283766508102417]],
    "documents": [
      [
        "Fitbit Alta HR Fitness Tracker",
        "Fitbit Charge 5 Advanced Fitness Tracker"
      ]
    ],
    "ids": [["18", "1"]],
    "metadatas": [
      [
        {
          "brand": "Fitbit",
          "price": "$129.95"
        },
        {
          "brand": "Fitbit",
          "price": "$179.95"
        }
      ]
    ]
  }
}

Take a moment to appreciate the magic here: the search query was able to understand that "heartrate monitor" is similar to "fitness tracker", even though the product description doesn't contain any of the search query's keywords.

With a functioning search engine, let's move on to building out the rest of the application (disclaimer: less AI goodness ahead).

Defining the API

We'll use Flask to build our API. It will have a single GET endpoint (/products) that allows us to perform a semantic search over our products collection. We'll also add an optional brand parameter that allows us to filter by it.

To keep our code organized and maintainable, we'll first create a Collection class that will contain all of our database operations and embedding needs:

import chromadb
import openai

class Collection:
    def __init__(self, openai_api_key):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection(name="products", 
            get_or_create=True)
        openai.api_key = openai_api_key

    def add(self, products):
        embeddings = [Collection.__get_embedding(p['description']) 
            for p in products]
        ids = [str(p['id']) for p in products]
        documents = [p['description'] for p in products]
        metadata = [{'brand': p['brand'], 'price': p['price']} for p in products]
        self.collection.add(ids=ids, 
            documents=documents, metadatas=metadata, embeddings=embeddings)

    def get(self, query, n_results=2, brand=None):
        embedding = Collection.__get_embedding(query)
        where = {'brand': brand} if brand else {}
        result = self.collection.query(query_embeddings=[embedding], 
            n_results=n_results, where=where)
        return Collection.__convert_to_products(result)

    @staticmethod
    def __get_embedding(text, model="text-embedding-ada-002"):
        return openai.Embedding.create(input=[text], 
            model=model)['data'][0]['embedding']

    @staticmethod
    def __convert_to_products(result):
        num_products = len(result['ids'][0])
        products = [
            {
                'id': result['ids'][0][i],
                'description': result['documents'][0][i],
                'brand': result['metadatas'][0][i]['brand'],
                'price': result['metadatas'][0][i]['price'],
                'distance': result['distances'][0][i]
            } for i in range(num_products)
        ]
        return products

We've already seen most of the code above, apart from some additions: a where argument to filter the collection results by brand and a helper method that converts the results from the Chroma collection into a format that is more user-friendly.

With our database logic encapsulated, let's build our Flask server:

import os
from products import Collection
from flask import Flask, request, jsonify

openai_api_key = os.getenv('OPENAI_API_KEY')
collection = Collection(openai_api_key)

app = Flask(__name__)

@app.route('/products', methods=['GET'])
def products():
    try:
        brand = request.args.get('brand')
        query = request.args.get('query')
        results = collection.get(query=query, brand=brand) if query else []
        return jsonify({'products': results})
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run()

For brevity, I omitted the code that seeds the database with the initial set of products.

Now, let's start the server:

$ flask run
* Debug mode: off
* Running on http://127.0.0.1:5000

And let's shop for some trainers:

$ curl -X GET 'http://127.0.0.1:5000/products?query=trainers'
{
  "products": [
    {
      "brand": "Nike",
      "description": "Nike Free Metcon 5 Training Shoes",
      "distance": 0.2830010950565338,
      "id": "2",
      "price": "$120.00"
    },
    {
      "brand": "Reebok",
      "description": "Reebok Women's CrossFit Nano 9 Training Shoes",
      "distance": 0.3306140601634979,
      "id": "6",
      "price": "$130.00"
    },
    {
      "brand": "TRX",
      "description": "TRX Suspension Trainer",
      "distance": 0.3345400094985962,
      "id": "8",
      "price": "$169.95"
    }
  ]
}

Let's try filtering by Nike:

$ curl -X GET 'http://127.0.0.1:5000/products?query=trainers&brand=Nike'
{
  "products": [
    {
      "brand": "Nike",
      "description": "Nike Free Metcon 5 Training Shoes",
      "distance": 0.2830010950565338,
      "id": "2",
      "price": "$120.00"
    },
    {
      "brand": "Nike",
      "description": "Nike Men's Dry Training Pants",
      "distance": 0.33897754549980164,
      "id": "30",
      "price": "$55.00"
    },
    {
      "brand": "Nike",
      "description": "Nike Women's Dry Tempo Running Shorts",
      "distance": 0.4058068096637726,
      "id": "48",
      "price": "$30.00"
    }
  ]
}

The Metcon 5s are impressive shoes but our search API is equally awe-inspiring. And the best part? We managed to build it in just under 50 lines of code!

We could have saved ourselves additional time and effort by using the Chroma OpenAI embeddings function to generate the embeddings automatically when inserting documents into Chroma. However, by generating the embeddings ourselves (and by that I mean calling the OpenAI API ourselves), we were able to gain a familiarity with the OpenAI API and a better understanding of how embeddings work.

With that said, it's time to build the frontend!

Building the user interface

Our frontend will be a simple HTML page, consisting of a search bar and a list of results based on the search query. To hold application state, we'll be using Alpine.js - a lightweight JavaScript framework that allows us to create reactive components with ease, without the need for more complex frameworks like React or Vue. (On a sidenote, Alpine.js great for small-to-medium-sized projects, and it's remarkable how much you can accomplish with it alone. In fact, I'm currently using it in my latest application, Limón, with great success.)

To start, we'll create a script tag in our HTML page that initializes an Alpine data component called search, which holds the application state and functionality. Normally we would put this in its own JavaScript file, but for the sake of simplicity, we'll keep it in the HTML page:

<script>
document.addEventListener('alpine:init', () => {
  Alpine.data('search', () => ({
    query: '',  
    products: [],

    searchProducts() {
      fetch('/products?query=' + this.query)
        .then((res) => {
          return res.json()
        })
        .then((data) => {
          this.products = data['products']
        })
        .catch((err) => {
          console.log(err)
        })
    },
  }))
})
</script>

The query field will be bound to the value of the search bar input field, and products will store the results of the query to the backend, which we'll then use to render the products on the page. We've also defined a searchProducts() function that will make a request to the backend and update products with the search results. Alpine takes care of the reactivity, so the UI will automatically update as the state changes (and vice versa).

In order to use the search component that we just defined, we need to use the x-data directive in our HTML. We'll add this to a parent div, so that all child elements can access the component's state and functions:

<body>
  <div class="container">
    <h1>Fitness store</h1>
    <div x-data="search">
    </div>
  </div>

  <script>
    // ... the Alpine data component defined earlier
  </script>
</body>

Now that the component is wired up to our HTML, we need to add the search bar and the list of results. Let's begin by adding the search bar:

<div x-data="search">
  <input 
    class="search-input" 
    type="text" 
    placeholder="Search for any product..." 
    x-model="query"
    @input="searchProducts" />
</div>

The x-model directive binds the value of the search bar input element to the query field in our component, automatically updating it as the user types. Moreover, the @input attribute makes any input events on the element trigger the searchProducts() method. This makes a request to the backend, which updates the products field in our component, and in turn Alpine automatically re-renders the UI with the new products.

However, this can potentially result in a large number of requests to the backend, since we're making a request each time the user types something. We can improve this by implementing a debouncing technique on the input events. Debouncing delays the triggerring of the function until the user pauses typing for a specific duration, reducing the number of requests sent to the backend. Fortunately, Alpine provides a straightforward way to debounce events by adding a debounce modifier to the @input event listener:

@input.debounce="searchProducts"

This adds a 250ms debouncing delay (which can be changed, if desired).

With the search bar complete, let's add the code that renders the list of products:

<div x-data="search">
  <input ... />

  <ul class="product-container">
    <template x-for="product in products">
      <li class="product-item">
        <div class="brand" x-text="product.brand"></div>
        <h2 class="description" x-text="product.description"></h2>
        <div class="price" x-text="product.price"></div>
      </li>
    </template>
  </ul>
</div>

We're using the x-for directive to iterate through the products in our component. For each product, we're rendering a div that displays the product's brand, description, and price. We're using the x-text directive to bind the values of these properties to their elements. The template tag is required by Alpine to enable the use of the x-for directive. Note that the template tag is also a special HTML tag that is hidden from the user.

Let's not forget to add some styling to make the page look a bit nicer:

Expand to see the CSS

body {
  font-family: 'Verdana';
}

ul {
  padding-left: 0;
  list-style-type: none;
}

.container {
  margin: 30px auto 30px auto;
  max-width: 48rem;
  padding: 0 1rem 0 1rem;
}

.search-input {
  height: 32px;
  font-size: 20px;
  width: 100%;
}

.product-container {
  display: flex;
  flex-direction: column;
  margin: 1rem auto 0 auto;
}

.product-item {
  border: 1px solid #ccc;
  border-radius: 0.5rem;
  padding: 10px;
  margin-top: 10px;
}

.product-item h2 {
  font-size: 20px;
  margin: 0;
}

.product-item .brand {
  color: #5a5a5a;
}

.product-item .description {
  margin: 10px 0 10px 0;
}

.product-item .price {
  color: green;
  float: right;
  font-weight: bold;
}

Finally, let's add a handler to our Flask server that will render the HTML page:

@app.route('/')
def index():
    return send_from_directory('public', 'index.html')

Let's start the server:

$ flask run
* Debug mode: off
* Running on http://127.0.0.1:5000

And visit 127.0.0.1:5000 in our browser:

Neat, right? If you'd like to see the code in its entirety, you can find it all on GitHub.

Conclusion

Semantic search is a powerful tool that can significantly enhance the user experience in our applications. What's even more exciting is that implementing it has never been easier. By leveraging a few powerful technologies, we were able to build an application powered by semantic search with just 100 lines of code, end-to-end. And this is just scratching the surface.

OpenAI provides a vast selection of APIs that go beyond embeddings. Chroma has richer querying features that we didn't explore in this post, and higher-level wrappers in LangChain offer even more functionality.

As new AI tooling and technologies continue to emerge, software engineers can expect to be spoilt for choice. It's an exciting time to be in this field, and the possibilities are endless. Make the most of it!