In today's digital era, search engines have become an indispensable tool for individuals to access information on almost any topic effortlessly and promptly.
This article aims to provide a step-by-step guide to build a search engine in Python using text embeddings. Word embeddings encode text into a numerical format in order to measure the similarity between two text pieces.
By following this tutorial, you will be able to construct your own search API using Eden AI embeddings and subsequently deploy it to Flask with ease.
First, prepare your dataset. For the purpose of this tutorial, we'll use a dataset of 40 AI features, where each feature is described by a short text description. The dataset can be in any format, but for simplicity, we'll use a CSV file. Here's a sample of what the dataset might look like:
Eden AI offers an extensive collection of APIs, encompassing pre-trained embeddings among others. For this tutorial, we will use Open AI to transform the text descriptions into numerical representations. This tutorial applies to other providers, such as Cohere (also available on Eden AI). However, bear in mind that when representing your text as embeddings, you should only use a single provider and not merge embeddings from multiple providers.
In the above code, we are applying Eden AI's embeddings API to each row of the "Description" column within the dataset, and saving the resulting embeddings in a new column called "description-embeddings".
Here's the code to generate embeddings using the Eden AI API:
NOTE: Don’t forget to replace <YOUR API KEY> with your actual Eden AI API key.
Now that we have the embeddings for each description of our dataset we can build a REST API with Flask that allows users to search for features in the dataset based on their query.
First, you'll need to create a virtual environment for your Flask project and install the required dependencies. Complete the following steps to do so:
1. Open up your command line interface and navigate to the directory where you want to create your Flask project.
2. Create a new virtual environment using the command python3 -m venv <name of environment>. Replace <name of environment> with a name of your choice, this will create a new virtual environment with its own Python interpreter and installed packages, separate from your system's Python installation.
3. Activate the virtual environment by running the command source <name of environment>/bin/activate.
4. Install the required dependencies for the project by running the following commands:
5. Import the dataset in your project and create two python file search.py that’s going to have the logic for our search and app.py for the REST API. This is the structure of our project:
6. In the file "search.py", we will implement the process of calling the Eden AI embeddings API (which was demonstrated earlier), as well as calculating the cosine similarity.
7. Our objective is to convert the user's search query into embeddings, and subsequently read the dataset to measure the cosine similarity between the query embeddings and the subfeature description embeddings. Our output will be a sorted list of the subfeatures based on their similarity scores, starting with the most similar:
8. In app.py, we will create an instance of our Flask project and define an endpoint for searching sub-features in our dataset. This endpoint will call the search_subfeature() function, which takes a description (query) as input.
9. Finally, start the Flask app by running the command flask --app app.py --debug run.
If everything runs successfully, you should see a message in your command line interface that says "Running on http://127.0.0.1:5000/".
To test our Search Engin API, we can send a GET request to the http://127.0.0.1:5000/search endpoint with the query as the payload. To do this, we can use a tool like Postman, which allows us to easily send HTTP requests and view the responses.
The most relevant result for the query "How can I extract data from a receipt document?" is indeed the receipt_parser subfeature.
You can access to the full code in this github repository : https://github.com/Daggx/embedding-search-engine