OpenAI 文本嵌入式模型之初学者指南

17611538698

webmaster@21cto.com

登录注册

OpenAI 文本嵌入式模型之初学者指南

万能的大雄

人工智能 0 1629 2024-09-17 08:46:37

导读：本文为使用 OpenAI 文本嵌入模型在生成式人工智能应用程序中进行嵌入创建和语义搜索的综合指南。

向量嵌入在人工智能中至关重要，它可以将复杂的非结构化数据转换为机器可以处理的数字向量。这些嵌入可以捕获数据中的语义含义与关系，从而实现更有效的分析和内容生成。

ChatGPT 的所属企业 OpenAI 提供了多种嵌入模型，这些模型可提供高质量的向量表示，可用于各种应用程序，包括语义搜索、聚类和异常检测。本指南将探讨如何利用 OpenAI 的文本嵌入模型来构建智能且响应迅速的 AI 系统。

何为向量嵌入与嵌入模型？

在我们深入讨论这些之前，先来了解几个术语。首先，什么是向量嵌入？它们是许多 AI 概念的基石。向量嵌入是数据的数字表示，特别是文本、视频、音频、图像和其他数字媒体等非结构化数据。它们捕捉数据中的语义含义和关系，并为存储系统和 AI 模型提供一种有效的方式来理解、处理、存储和检索复杂且高维的非结构化数据。

那么，如果嵌入是数据的数字表示，那么如何将数据转换为向量嵌入？这就是嵌入模型的作用所在。

嵌入模型是一种将非结构化数据转换为向量嵌入的专门算法。它的目标是在于学习数据中的模式和关系，然后在高维空间中表达它们。其关键思想是相似的数据将具有相似的向量表示，并且在高维空间中彼此更接近，从而使 AI 模型能够更有效地处理和分析数据。

比如，在自然语言处理 (NLP) 的背景下，嵌入模型可能会了解到单词“king”和“queen”是相关的，并且应该在向量空间中彼此靠近，而单词“banana”则应该放在较远的位置。

向量空间中的这种接近性反映了单词之间的语义关系。

嵌入模型和向量嵌入的常见用途是在检索增强生成( RAG ) 系统中。

RAG 系统并非仅仅依赖大型语言模型( LLM ) 中的预训练知识，而是在生成输出之前为 LLM 提供额外的上下文信息。这些额外的数据使用嵌入模型转换为向量嵌入，然后存储在Milvus等向量数据库中（它也可通过Zilliz Cloud作为完全托管的服务提供）。

RAG 非常适合需要详细、基于事实的查询响应的组织和开发人员，这使其在各个业务领域都具有价值。

OpenAI 文本嵌入模型

OpenAI提供了多种嵌入模型，非常适合语义搜索、聚类、推荐系统、异常检测、多样性测量和分类等任务。

有鉴于 OpenAI 的受欢迎程度，许多开发人员可能会使用其模型尝试 RAG 概念。虽然这些概念通常适用于嵌入模型，但还是让我们关注 OpenAI 具体提供的内容。

当谈到 NLP 时，这些 OpenAI 嵌入模型尤其重要。包括：

text-embedding-ada-002
text-embedding-3-small
text-embedding-3-large

以下表格对这些模型进行了直接的比较。

模型	描述	输出维度	最大输入	价格
text-embedding-3-large	同时适合英语和非英语任务的嵌入模型	3,072	8.191	0.13 美元 / 100 万个token
text-embedding-3-small	比第二代ada嵌入模型提高了性能	8.191	8.191	0.10美元/100万个token
text-embedding-ada-002	性能最强的第二代嵌入模型，取代16个第一代模型	1,536	8.191	0.02 美元 / 100 万个token

选择正确的模型

与所有事情一样，选择模型也需要做权衡利弊。

在全力投入其中一种模型之前，请确保您清楚地了解自己想要做什么、拥有哪些可用资源以及对生成的输出的准确度的期望。

使用 RAG 系统，你可能需要在计算资源与查询响应的速度和准确性之间取得平衡。

text-embedding-3-large：当准确性和嵌入丰富度至关重要时，这可能是首选模型。它使用最多的 CPU 和内存资源（即更昂贵），并且需要最长的时间来生成输出，但输出将是高质量的。典型的用例包括研究、高风险应用程序或处理非常复杂的文本。
text-embedding-3-small：如果你更关心速度和效率，而不是实现绝对最佳结果，则此模型占用的资源较少，从而降低了成本并缩短了响应时间。典型用例包括实时应用程序或资源有限的情况。
text-embedding-ada-002：虽然其他两个模型是最新版本，但这是 OpenAI 在推出之前技术领先的模型。这种多功能模型在两个极端之间提供了良好的中间地带，以合理的效率提供稳定的性能。

如何使用 OpenAI 生成向量嵌入

让我们逐步了解如何使用每个嵌入模型生成向量嵌入。无论你选择哪种模型，你都需要一些东西才能开始，包括一个向量数据库。

PyMilvus是Milvus的Python软件开发工具包 (SDK) ，它在这种情况下非常方便，因为它可以与所有这些 OpenAI 模型无缝集成。OpenAI Python 库是另一个选择，这是 OpenAI 提供的 SDK。

GitHub：https://github.com/milvus-io/pymilvus

但是对于本篇教程，我们将使用 PyMilvus 生成向量嵌入并将其存储在 Zilliz Cloud 中以进行简单的语义搜索。

开始使用 Zilliz Cloud 非常简单：

注册一个免费的 Zilliz Cloud 帐户。
设置无服务器集群并获取公共端点和 API 密钥。
创建一个向量集合并插入你的向量嵌入。
对存储的嵌入运行语义搜索。

好的，现在我将解释如何为上面讨论的三个模型中的每一个生成向量嵌入。

text-embedding-ada-002

生成向量嵌入并将text-embedding-ada-002 存储在 Zilliz Cloud 中，以进行语义搜索：

from pymilvus.model.dense import OpenAIEmbeddingFunctionfrom pymilvus import MilvusClient
OPENAI_API_KEY = "your-openai-api-key"ef = OpenAIEmbeddingFunction("text-embedding-ada-002", api_key=OPENAI_API_KEY)
docs = [  "Artificial intelligence was founded as an academic discipline in 1956.",  "Alan Turing was the first person to conduct substantial research in AI.",  "Born in Maida Vale, London, Turing was raised in southern England."]# Generate embeddings for documentsdocs_embeddings = ef(docs)
queries = ["When was artificial intelligence founded",         "Where was Alan Turing born?"]# Generate embeddings for queriesquery_embeddings = ef(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Keyclient = MilvusClient(   uri=ZILLIZ_PUBLIC_ENDPOINT,   token=ZILLIZ_API_KEY)
COLLECTION = "documents"if client.has_collection(collection_name=COLLECTION):   client.drop_collection(collection_name=COLLECTION)client.create_collection(   collection_name=COLLECTION,   dimension=ef.dim,   auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):   client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(   collection_name=COLLECTION,   data=query_embeddings,   consistency_level="Strong",   output_fields=["text"])

text-embedding-3-small

生成向量嵌入并将text-embedding-3-small存储在 Zilliz Cloud 中，以进行语义搜索：

from pymilvus import model, MilvusClient
OPENAI_API_KEY = "your-openai-api-key"ef = model.dense.OpenAIEmbeddingFunction(  model_name="text-embedding-3-small",  api_key=OPENAI_API_KEY,  )
# Generate embeddings for documentsdocs = [  "Artificial intelligence was founded as an academic discipline in 1956.",  "Alan Turing was the first person to conduct substantial research in AI.",  "Born in Maida Vale, London, Turing was raised in southern England."]
docs_embeddings = ef.encode_documents(docs)
# Generate embeddings for queriesqueries = ["When was artificial intelligence founded",         "Where was Alan Turing born?"]
query_embeddings = ef.encode_queries(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Keyclient = MilvusClient(   uri=ZILLIZ_PUBLIC_ENDPOINT,   token=ZILLIZ_API_KEY)
COLLECTION = "documents"if client.has_collection(collection_name=COLLECTION):   client.drop_collection(collection_name=COLLECTION)client.create_collection(   collection_name=COLLECTION,   dimension=ef.dim,   auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):   client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(   collection_name=COLLECTION,   data=query_embeddings,   consistency_level="Strong",   output_fields=["text"])

text-embedding-3-large

生成向量嵌入并将text-embedding-3-large存储在 Zilliz Cloud 中，以进行语义搜索：

from pymilvus.model.dense import OpenAIEmbeddingFunctionfrom pymilvus import MilvusClient
OPENAI_API_KEY = "your-openai-api-key"ef = OpenAIEmbeddingFunction("text-embedding-3-large", api_key=OPENAI_API_KEY)
docs = [  "Artificial intelligence was founded as an academic discipline in 1956.",  "Alan Turing was the first person to conduct substantial research in AI.",  "Born in Maida Vale, London, Turing was raised in southern England."]
# Generate embeddings for documentsdocs_embeddings = ef(docs)
queries = ["When was artificial intelligence founded",         "Where was Alan Turing born?"]
# Generate embeddings for queriesquery_embeddings = ef(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Keyclient = MilvusClient(   uri=ZILLIZ_PUBLIC_ENDPOINT,   token=ZILLIZ_API_KEY)
COLLECTION = "documents"if client.has_collection(collection_name=COLLECTION):   client.drop_collection(collection_name=COLLECTION)client.create_collection(   collection_name=COLLECTION,   dimension=ef.dim,   auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):   client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(   collection_name=COLLECTION,   data=query_embeddings,   consistency_level="Strong",   output_fields=["text"])