Making LLMs smarter with Extensible Knowledge Access
This guide shows how to use Retrieval Augmented Generation (RAG) to enhance a large language model (LLM). RAG is the process of enabling an LLM to reference context outside of its initial training data before generating its response. Training a model that is useful for your own domain-specific purposes can be extremely expensive in both time and computing power. Therefore, using RAG is a cost-effective alternative to extending the capabilities of an existing LLM. To demonstrate RAG in this guide, we'll provide Llama 3.2 with access to Nitric's documentation so that it can answer specific questions. You can adjust this guide with another data source that meets your needs.
Prerequisites
- uv - for Python dependency management
- The Nitric CLI
- (optional) An AWS account
Getting started
We'll start by creating a new project using Nitric's python starter template.
If you want to take a look at the finished code, it can be found here.
nitric new llama-rag py-startercd llama-rag
Next, let's install our base dependencies, then add the llama-index
libraries. We'll be using llama index as it makes creating RAGs extremely simple and has support for running our own local Llama 3.2 models.
# Install the base dependenciesuv sync# Add Llama index dependenciesuv add llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp --optional ml
We add the extra dependencies to the 'ml' optional dependencies to keep them separate since they can be quite large. This lets us just install them in the containers that need them.
We'll organize our project structure like so:
+--common/| +-- __init__.py| +-- model_parameters.py| +-- resources.py+--services/| +-- subscriber.py| +-- chat.py+--.gitignore+--.python-version+-- model.dockerfile+-- model.dockerignore+-- model_utilities.py+-- pyproject.toml+-- python.dockerfile+-- python.dockerignore+-- nitric.yaml+-- README.md
Setting up our LLM
We'll define a ModelParamters
class which will have parameters used throughout our application. By putting it in a class, it means it will lazily load the llm and embed model so that it doesn't slow down other modules that don't require everything to be initialised. At this point we can also create a prompt template for prompts with our query engine. It will just sanitize some of the hallucinations so that if the model does not know an answer it won't pretend like it does.
import osfrom llama_index.core import ChatPromptTemplatefrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingclass ModelParameters:# Lazily loaded llmllm = None# Lazily loaded embed modelembed_model: HuggingFaceEmbedding = None# Set the location that we will persist our embedspersist_dir = "./models/query_engine_db"# Set the location to cache the embed modelembed_cache_folder = os.getenv("HF_CACHE") or "./models/vector_model_cache"# Set the location to store the llmllm_cache_folder = "./models/llm_cache"# Create the prompt query templates to sanitise hallucinationsprompt_template = ChatPromptTemplate.from_messages([("system","If the context is not useful, respond with 'I'm not sure'.",),("user",("Context information is below.\n""---------------------\n""{context_str}\n""---------------------\n""Given the context information and not prior knowledge ""answer the question: {query_str}\n.")),])def __init__(self):# Lazily load the locally stored Llama modelself._llm = None# Lazily load the Embed from Hugging Face modelself._embed_model = None@propertydef llm(self):from llama_index.llms.llama_cpp import LlamaCPPif self._llm is None:print("Initializing Llama CPP Model...")self._llm = LlamaCPP(model_url=None,model_path=f"{self.llm_cache_folder}/Llama-3.2-1B-Instruct-Q4_K_M.gguf",temperature=0.7,verbose=False,)return self._llm@propertydef embed_model(self):if self._embed_model is None:print("Initializing Embed Model...")self._embed_model = HuggingFaceEmbedding(model_name=self.embed_cache_folder,cache_folder=self.embed_cache_folder)return self._embed_model
Building a Query Engine
The next step is where we embed our context into the LLM. For this example we will embed the Nitric documentation. It's open-source on GitHub, so we can clone it into our project.
git clone https://github.com/nitrictech/docs.git nitric-docs
We'll create a script which will download the LLM, the embed model (using a recommended model from Hugging Face), and create the vectorised documentation using the embed model.
import osfrom urllib.request import urlretrievefrom common.model_parameters import ModelParametersfrom llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settingsfrom huggingface_hub import snapshot_downloaddownload_embed_model()download_llm()build_query_engine()
You can then run this using the following command. This should output the models and the vectorised documentation into the ./models
folder.
uv run model_utilities.py
Create our resources
Let's create our resources in a common file so that it can be imported to the subscriber and chat modules. We'll create a websocket which will interface with the user for prompts and create a topic to handle the backend query engine. The websocket will trigger the topic on a prompt message, which will trigger the subscriber to handle the prompt. Once the subscriber is finished it will send a response to the socket. It is done this way with the topic so that the websocket doesn't time out after 30 seconds, as most queries will take longer than that to process.
from nitric.resources import websocket, topicsocket = websocket("socket")chat_topic = topic("chat")
Use the resources for querying the model
With our LLM downloaded and given the context documentation for querying, we can use our websocket to handle prompts. The main piece of logic here is publishing to the chat topic
from common.resources import socket, chat_topicfrom nitric.context import WebsocketContextfrom nitric.application import Nitricpublishable_chat_topic = chat_topic.allow("publish")@socket.on("connect")async def on_connect(ctx):# handle connectionsprint(f"socket connected with {ctx.req.connection_id}")return ctx@socket.on("disconnect")async def on_disconnect(ctx):# handle disconnectionsprint(f"socket disconnected with {ctx.req.connection_id}")return ctx@socket.on("message")async def on_message(ctx: WebsocketContext):# Publish to the topic with the connection id and the prompt.await publishable_chat_topic.publish({"connection_id": ctx.req.connection_id,"prompt": ctx.req.data.decode("utf-8")})return ctxNitric.run()
We'll then create our subscriber which will respond to the publish requests.
import osfrom common.model_parameters import ModelParametersfrom common.resources import chat_topic, socketfrom nitric.context import MessageContextfrom nitric.application import Nitricfrom llama_index.core import StorageContext, load_index_from_storage, Settings@chat_topic.subscribe()async def query_model(ctx: MessageContext) -> str:params = ModelParameters()prompt = ctx.req.data.get("prompt")Settings.llm = params.llmSettings.embed_model = params.embed_model# Get the model from the stored local contextif os.path.exists(ModelParameters.persist_dir):print("Loading model from storage...")storage_context = StorageContext.from_defaults(persist_dir=params.persist_dir)index = load_index_from_storage(storage_context)else:print("model does not exist")ctx.res.success = Falsereturn ctx# Get the query engine from the index, and use the prompt template for santisation.query_engine = index.as_query_engine(streaming=False,similarity_top_k=4,text_qa_template=params.prompt_template)print(f"Querying model: \"{prompt}\"")# Query the modelquery_resp = query_engine.query(prompt)print(f"Response: {query_resp}")# Send a response back to the socketawait socket.send(ctx.req.data.get("connection_id"),query_resp.response.encode("utf-8"))return ctxNitric.run()
Test it locally
Now that our application is complete, we can test it locally. You can do this using nitric start
and connecting to the websocket through either the Nitric Dashboard or another Websocket client. Once connected, you can send a message with a prompt to the model. Sending a prompt like "What is Nitric?" should produce an output similar to:
Nitric is a cloud-agnostic framework designed to aid developers in building full cloud applications, including infrastructure.
Get ready for deployment
Now that its tested locally, we can get our project ready for containerization. The default python dockerfile uses python3.11-bookworm-slim
as its basic container image, which doesn't have the right dependencies to load the Llama model. So, we'll start by creating a new python Dockerfile which uses python3.11-bookworm (the non-slim version) instead. We'll keep the default dockerfile for our chat
service but use the new Dockerfile for the subscriber
service.
Update line 2:
-FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder+FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder
And line 17:
-FROM python:3.11-slim-bookworm+FROM python:3.11-bookworm
We'll also change the model.dockerfile
to download the extra ml dependencies.
RUN --mount=type=cache,target=/root/.cache/uv \-uv sync --frozen --no-install-project --no-dev --no-python-downloads+uv sync --extra ml --frozen --no-install-project --no-dev --no-python-downloadsCOPY . /appRUN --mount=type=cache,target=/root/.cache/uv \-uv sync --frozen --no-dev --no-python-downloads+uv sync --extra ml --frozen --no-dev --no-python-downloads
To ensure an optimised docker image, update the python.dockerfile.dockerignore
to include the models folder.
.mypy_cache/.nitric/.venv/nitric-spec.jsonnitric.yamlREADME.mdmodels/
We can then update the nitric.yaml
file to point each service to the correct dockerfile.
name: llama-ragservices:- match: services/chat.pyruntime: pythonstart: uv run watchmedo auto-restart -p *.py --no-restart-on-command-exit -R uv run $SERVICE_PATH- match: services/subscriber.pyruntime: modelstart: uv run watchmedo auto-restart -p *.py --no-restart-on-command-exit -R uv run $SERVICE_PATHruntimes:python:dockerfile: ./python.dockerfilemodel:dockerfile: ./model.dockerfile
Deploy the project
When you're ready to deploy the project, we can create a new Nitric stack file that will target AWS:
nitric stack new dev aws
Update the stack file nitric.dev.yaml
with the appropriate AWS region and memory allocation to handle the model:
provider: nitric/aws@1.14.0region: us-east-1config:# How services will be deployed by default, if you have other services not running models# you can add them here too so they don't use the same configurationdefault:lambda:# Set the memory to 6GB to handle the model, this automatically sets additional CPU allocationmemory: 6144# Set a timeout of 900 seconds (maximum for a lambda)timeout: 900# We add more storage to the lambda function, so it can store the modelephemeral-storage: 1024
We can then deploy using the following command:
nitric up
Testing on AWS we'll need to use a Websocket client or the AWS portal. You can verify it in the same way as locally by connecting to the websocket and sending a message with a prompt for the model.
Once you're finished querying the model, you can destroy the deployment using nitric down
.
Summary
In this project we've successfully augmented an LLM using Retrieval Augmented Generation (RAG) techniques with Llama Index and Nitric. You can modify this project to use any LLM, change the prompt template to be more specific in responses, or change the context for your own personal requirements. We could extend this project to maintain context between requests using a Key Value Store to have more of a chat-like experience with the model.