Finally, we’re ready to discuss some more interesting content. This time, we will briefly introduce RAG and the related technologies and concepts. I will also share my understanding of the relationship between RAG technology and “machine memory.” Based on our discussion, we will do a simple practical exercise using AI-controlled RAG tags for simple auxiliary retrieval of permanent memory.

Introduction

In this series of articles, LLM is a concept we frequently mention, which has recently evolved rapidly. For example, the recent deepseek R1 series has made significant contributions to the history of LLM thanks to its low inference cost and high quality, as well as being open source. In contrast, a certain CloseAI clearly shows the disparity; OpenAI, do take more care… However, it should be noted that despite the current power of modern LLMs, they still have a major flaw, which I believe is fatal: the model is static and difficult to update in real-time. This means that the model itself is static. While techniques such as “Finetune”, “LORA”, and “RLHF” can alter the output tendencies of LLMs, they do not fundamentally update knowledge. Unless… we pre-train almost completely (which yields the best results), utilize “continuous learning,” or employ “partial pre-training” (with moderate effects), among other complex data cleaning, sorting, and refining steps. As it stands, rapidly iterating a relatively usable large LLM is quite unrealistic for individuals or average enterprises. Ultimately, these models were not designed for real-time information iteration in the first place. I speculate this is perhaps due to the introduction of time series, which could turn the prediction function into a very complex non-linear entity. (I promise that’s not me parroting Richard S. Sutton’s viewpoint) All in all, current low-cost online learning LLM technologies may not yet be mature. If we want to relatively simply endow LLMs with real-time knowledge, from my rudimentary understanding, this would likely involve introducing a partition into the Prompt, utilizing an external database or real-time information retrieval to insert into a specific partition of the Prompt for more accurate knowledge updates.

This idea sounds straightforward and crude, but implementing it in real-world engineering poses many challenges. For instance, how do we achieve a relatively optimal solution balancing performance, cost, and real-time efficiency? How do we dynamically adjust retrieval/writing strategies based on different scenarios? What optimizations can we develop regarding memory specialization? We will touch on these issues lightly today.

Some Concepts

Word Embedding

In natural language processing, word embedding represents a word in a numeric format. This embedding is used for text analysis. Generally, this representation is a real-valued vector that encodes the meaning of the words in such a way that words closer together in the vector space are expected to be similar in meaning. Word embeddings can be obtained through language modeling and feature learning techniques that map words or phrases in the vocabulary to real-valued vectors.

Is this hard to understand? Because that excerpt is straight from Wikipedia😅, let me explain in simpler language:

Word embedding is a technology that enables computers to “understand” the meanings of words. Its core idea is to convert each word into a very high-dimensional numeric vector (like [0.3, -1.2, 4.5]), which can be interpreted as the coordinates of words in a “semantic space.” For instance, the vectors for “apple” and “orange” are close together (since they are both fruits), while the vectors for “apple” and “car” are farther apart. This representation allows computers to capture the semantic relationships between words through vector calculations.

What Does Word Embedding Do?

Traditional AI, when processing text, can only see isolated characters or word frequencies (e.g., counting how many times the term “memory” appears in this article?) without understanding the relationships between words. Word embedding resolves this issue through two main points:

  1. Semantic Similarity: Mapping semantically similar words to adjacent positions in vector space. For example, the vectors for “hospital” and “clinic” are close.
  2. Semantic Relationship Modeling: Supporting analogy reasoning, such as “king - man + woman ≈ queen.” This ability is crucial for complex tasks (like memory systems). “Genshin Impact - benefits + plot cannot skip + toxicity ≈ relying on defense”
  3. Semantic Relationship Classification: Classes or filters can be formed by individuals based on several words (mapped as points in high-dimensional space), similarly to how coordinates in a 2D vector can be grouped by classification concepts. This is beneficial for enhancing retrieval speed and accuracy/semantic understanding (though not strictly a classification method, for ease of understanding).

Behavior in RAG

  1. Question Vectorization: When a user asks, “How can I relieve cold symptoms?” RAG first converts the question into simple phrase groups, which are then transformed into embedding vectors.
  2. Semantic Retrieval: The system searches for text segments in the document pool closest to the question vector (e.g., “home care tips for cold: rest, stay hydrated…”), rather than relying solely on simple keyword matching.
  3. Augmented Generation: The generation model (e.g., GPT) outputs answers based on the retrieved content, ensuring that the information is accurate and relevant.

Additional Supplement…

Initial embedding models like Word2Vec and GloVe are static (each word has a single vector), whereas modern models generate contextually relevant dynamic vectors (e.g., “apple” may have different vectors in “eating apples” and “Apple smartphone”). Contemporary RAG systems usually employ sentence or paragraph embeddings (like OpenAI’s text-embedding-3-small), but the underlying principles are consistent with word embedding. Embedding is, in essence, a quantification and encoding technology, not limited solely to text. We can extract features from images, videos, and audio for embedding, which connects to mixed RAG content. Additionally, relationships between different types of information can also be embedded, much like nested dolls, where layers of relationships are refined—this relates to Graph RAG content. In summary, there are many ways to approach this, but one constant is that machines have limitations in processing information; any system without self-correcting measures will have an error rate that can only approach the training error rates provided by humans. This could influence the design philosophy of memory systems.

What is in RAG?

RAG Diagram

(By Turtlecrown - Own work, CC BY-SA 4.0)

Note that this is a fairly typical RAG system diagram; the actual implementation is quite flexible, but generally consists of three modules: Retriever, Generator, and Injector.

Retriever

The retriever is responsible for quickly retrieving files/text segments related to the input question from external databases (like vector databases and their corresponding file clusters). It may employ techniques like sparse retrieval or dense retrieval. The former uses traditional matching algorithms, while the latter relates to the embeddings we just discussed, as the latter is semantic-based. Regardless, the retriever must balance recall rates and retrieval speed; although the subsequent generator is theoretically slower than the retriever, we still aim for optimization where possible.

Let’s understand the retriever simply: it finds all approximate content related to the question, but this may include irrelevant information that could adversely affect the answer. However, the retriever doesn’t concern itself with that; it retrieves everything based on the RAG philosophy, suggesting that having excess information is better than missing information. The subsequent generator will analyze and sort through the content returned by the retriever. Even so, there is still substantial room for optimizing the retriever, as dynamically selecting retrieval modes for different task types can yield significant improvements. For example, a popular hybrid retrieval approach utilizes concurrent BM25 + fuzzy search + DPR methods, though this approach was previously charged in several open-source RAG projects; its current pricing status is uncertain.

Generator

The generator is responsible for comprehensively analyzing the retrieved context, the input question, and the contextual background to produce the final answer. Simply put, it consolidates all available information and determines which information genuinely addresses the question, selectively retaining relevant content to produce the best response.

There is also optimization potential here. We can employ multiple different LLMs for concurrent generation with attention mechanisms, or utilize more personalized weighting controls, such as leveraging the forgetting curve/tags/environmental awareness to obtain more precise responses.

Other Auxiliary/Optimization Components

Implementing a non-relational database for auxiliary indexing to improve performance and reduce vector library costs; then there’s Agent RAG, which is relatively smarter (more costly) RAG that employs LLM-assisted decision-making strategies; integrating online searches or real-time decisions based on real-world sensor data…and so on and so forth…

Add more water to the flour; add more flour to the water

Vector Database

A vector database is a system specifically designed to handle high-dimensional vector data. It acts like an intelligent library that not only stores vast amounts of unstructured data (such as text, images, audio) but also quickly finds the most relevant content through “semantic understanding.” Unlike traditional databases that rely on keyword matching, the core of a vector database is the transformation of data into vectors (a group of numbers forming a “fingerprint”) and measuring their correlation by calculating distances between those vectors (like cosine similarity). This ties back to the embedding techniques discussed previously! The vectors generated by embeddings can be stored in a vector database for subsequent indexing/querying/updating.

Milvus

Milvus is an open-source vector database that is powerful, highly performant, and well-documented. It achieves millisecond-level retrieval of billions of data through distributed architecture and efficient indexing algorithms (like HNSW, IVF). This capability makes vector databases a foundational support for constructing RAG systems, particularly in scenarios requiring real-time responses and dynamic memory updates. It serves both as a “memory hub” and an “information filter.” Milvus’s indexing and retrieval functionalities are robust enough to function as a retriever independently, making it an excellent companion for other retrieval systems!

There are also some lightweight local vector databases or free online services, such as Pinecone or Chroma, which won’t be elaborated on here; the principles are quite similar. Please refer to their respective documentation for usage guidance. This tutorial will take Milvus as the example vector database.

There are a few concepts in vector databases that we need to understand:

  • Collection: A container for storing vectors and metadata (similar to a table in a database). Although it has a dynamic schema, if not categorized, it may become disorganized.
  • Index: A data structure that accelerates vector retrieval (like IVF_FLAT, HNSW, etc.). After storing a batch of vectors, the database will pre-generate an index, greatly improving retrieval speed.
  • Partition: A logical shard that optimizes large-scale data management. If your memory spans thousands of years, partitions might come in handy…

Let’s Dive In!

Having written so much, I’m almost falling asleep… Let’s get to some practical work! We’ll set up a very simple RAG to retrieve conversational summaries from our previous article!

Deploy Milvus Vector Database

The tried-and-true method is to deploy using Docker Compose for easy management and backup/restoration. First, create a folder for storing Milvus-related Compose files:

mkdir milvus
cd milvus

Then create a docker-compose.yaml file:

vim docker-compose.yaml

Configuration file:

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    
  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.4.6
    command: ["milvus", "run", "standalone"]
    security_opt:
    - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
      - ./milvus.yaml:/milvus/configs/milvus.yaml
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

networks:
  default:
    name: milvus
      driver: bridge

(Yes, it looks complicated…)

Having a completely independent compose file aids in individual maintenance, like for automatic database backups/restorations and clustering.

Now, save and exit.

Configuring Milvus

Configuring Milvus requires a milvus.yaml file. The official Milvus documentation provides a template for the configuration file, which we can download using this link/command:

wget https://raw.githubusercontent.com/milvus-io/milvus/v2.5.4/configs/milvus.yaml

Please make sure to choose the version number and tags according to your version and image, as well as your deployment method.

You can opt to add authentication and more configuration items; this tutorial won’t delve into those details further; please refer to the official documentation: Official Documentation.

For this tutorial, we will use the default configuration. You can later enhance this with a reverse proxy, set authentication, and implement automatic backups.

Let’s try to start it up for the first time:

docker-compose up -d

You might not want to use -d on the first try to check for issues, although the logs can pass by quickly, making it hard to follow… If there are no issues, we will access the vector database remotely using the Milvus Python SDK.

Simple Database Operations via Python SDK

Initialization and Creation

First, let’s install the SDK:

pip install -U pymilvus

Now, we’ll create a database named cyberai_mem.

from pymilvus import connections, db
conn = connections.connect(host="127.0.0.1", port=19530)  # Replace with your actual Milvus server address and port
database = db.create_database("cyberai_mem")

You can also write it this way:

conn = connections.connect(
    host="127.0.0.1",
    port="19530",
    db_name="cyberai_mem"
)

Of course, if creating the database or collection is cumbersome using scripts or commands, we can use a GUI client for simple database operations: Attu.

Using this GUI, you can perform some basic database operations.

Vector Search, Youngster

Connect to the Database

Let’s connect to the database!

from pymilvus import MilvusClient
MilvusClient = MilvusClient(
    uri="http://127.0.0.1:19530",  # Replace with your actual Milvus server address and port
    db_name="cyberai_mem",
)

Create a Collection

Now, let’s create a collection! Specify the name of the collection and the dimension of the vectors:

if client.has_collection(collection_name="demo_collection"):
    client.drop_collection(collection_name="demo_collection")
client.create_collection(
    collection_name="demo_collection",
    dimension=1536,  # The vectors we will use in this demo have 1536 dimensions, matching your embedding model's output dimension
)

Note that this uses default collection settings, including vector indexing methods and templates. A more advanced tutorial later on will cover dynamic Schema and custom templates… (laying the groundwork)

In the default settings, the primary key and vector fields use their default names (“id” and “vector”), and the metric type (vector distance definition) is set to its default value (COSINE).

Prepare Data and Vectorize

Unfinished business ……