使用 LangExtract 和 Elasticsearch


                                                                                                                                                <p><span style="background-color:#ffffff; color:#4d4d4d">作者:来自 Elastic </span>Jeffrey Rengifo</p> 

亲身体验 Elasticsearch:深入了解我们的示例笔记本,开始免费试用云,或立即在本地计算机上试用 Elastic。


LangExtract  是由 Google 创建的开源  Python 库,可帮助使用多个 LLM 和自定义指令将非结构化文本转换为结构化信息。与单独使用 LLM 不同,LangExtract 提供结构化且可追溯的输出,将每个提取链接回原始文本,并提供用于验证的可视化工具,使其成为在不同上下文中提取信息的实用解决方案。

当您想要将非结构化数据(例如合同、发票、账簿等)转换为定义的结构,使其可搜索和过滤时,LangExtract 非常有用。例如,对发票的费用进行分类,提取合同中的当事人,甚至检测书中某个段落的人物的情绪。

LangExtract 还提供长上下文处理、远程文件加载、提高召回率的多次传递以及并行化工作的多个工作程序等功能。

用例

为了演示 LangExtract 和 Elasticsearch 如何协同工作,我们将使用一个包含 10 份不同类型合同的数据集。这些合同包含标准数据,如费用、金额、日期、期限和承包方。我们将使用 LangExtract 从合同中提取结构化数据,并将其作为字段存储在 Elasticsearch 中,从而可以对其运行查询和筛选。

你可以在这里找到完整的 notebook。

步骤

  1. 安装依赖并导入包

  2. 设置 Elasticsearch

  3. 使用 LangExtract 提取数据

  4. 查询数据

安装依赖并导入包

我们需要安装 LangExtract 来处理合同并从中提取结构化数据,同时还需要安装 elasticsearch 客户端来处理 Elasticsearch 请求。

%pip install langextract elasticsearch -q

依赖安装完成后,让我们导入以下内容:

  • json —— 用于处理 JSON 数据

  • os —— 用于访问本地环境变量

  • glob —— 用于基于模式搜索目录中的文件

  • google.colab —— 在 Google Colab notebooks 中很有用,可以加载本地存储的文件

  • helpers —— 提供额外的 Elasticsearch 工具,例如批量插入或更新多个文档

  • IPython.display.HTML —— 允许你在 notebook 中直接渲染 HTML 内容,使输出更易读

  • getpass —— 用于安全输入敏感信息,如密码或 API key,而不会在屏幕上显示

import langextract as lx
import json
import os
import glob

from google.colab import files
from elasticsearch import Elasticsearch, helpers
from IPython.display import HTML
from getpass import getpass

设置 Elasticsearch

设置密钥

在开发应用之前,我们需要设置一些变量。我们将使用 Gemini AI 作为我们的模型。你可以在这里学习如何从  Google AI Studio  获取 API key。同时,确保你有一个  Elasticsearch API key  可用。

os.environ["ELASTICSEARCH_API_KEY"] = getpass("Enter your Elasticsearch API key: ")
os.environ["ELASTICSEARCH_URL"] = getpass("Enter your Elasticsearch URL: ")
os.environ["LANGEXTRACT_API_KEY"] = getpass("Enter your LangExtract API key: ")

INDEX_NAME = "contracts"

Elasticsearch client

es_client = Elasticsearch(
    os.environ["ELASTICSEARCH_URL"], api_key=os.environ["ELASTICSEARCH_API_KEY"]
)

索引映射

让我们为要用 LangExtract 提取的字段定义 Elasticsearch 映射。注意,对于只想用于过滤的字段我们使用  keyword,而对于既要搜索又要过滤的字段我们使用  text + keyword

try:
    mapping = {
        "mappings": {
            "properties": {
                "contract_date": {"type": "date", "format": "MM/dd/yyyy"},
                "end_contract_date": {"type": "date", "format": "MM/dd/yyyy"},
                "service_provider": {
                    "type": "text",
                    "fields": {"keyword": {"type": "keyword"}},
                },
                "client": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
                "service_type": {"type": "keyword"},
                "payment_amount": {"type": "float"},
                "delivery_time_days": {"type": "numeric"},
                "governing_law": {"type": "keyword"},
                "raw_contract": {"type": "text"},
            }
        }
    }

    es_client.indices.create(index=INDEX_NAME, body=mapping)
    print(f"Index {INDEX_NAME} created successfully")
except Exception as e:
    print(f"Error creating index: {e}")

使用 LangExtract 提取数据

提供示例

LangExtract 代码定义了一个训练示例,展示如何从合同中提取特定信息。

contract_examples  变量包含一个 ExampleData 对象,其中包括:

  • 示例文本:一份包含典型信息(如日期、当事人、服务、付款等)的示例合同

  • 预期提取结果:一个提取对象列表,将文本中的每条信息映射到一个特定类(extraction_class)及其规范化值(extraction_text)。extraction_class 将作为字段名,而 extraction_text 将作为该字段的值

例如,文本中的日期 “March 10, 2024” 被提取为类 contract_date(字段名),其规范化值为 “03/10/2024”(字段值)。模型通过这些模式学习,从而能从新合同中提取类似信息。

contract_prompt_description 提供了关于提取什么以及以什么顺序提取的额外上下文,用来补充单靠示例无法表达的内容。

contract_prompt_description = "Extract contract information including dates, parties (contractor and contractee), purpose/services, payment amounts, timelines, and governing law in the order they appear in the text."

# Define contract-specific example data to help the model understand what to extract
contract_examples = [
    lx.data.ExampleData(
        text="Service Agreement dated March 10, 2024, between ABC Corp (Service Provider) and John Doe (Client) for consulting services. Payment: $5,000. Delivery: 30 days. Contract ends June 10, 2024. Governed by California law.",
        extractions=[
            lx.data.Extraction(
                extraction_class="contract_date", extraction_text="03/10/2024"
            ),
            lx.data.Extraction(
                extraction_class="end_contract_date", extraction_text="06/10/2024"
            ),
            lx.data.Extraction(
                extraction_class="service_provider", extraction_text="ABC Corp"
            ),
            lx.data.Extraction(extraction_class="client", extraction_text="John Doe"),
            lx.data.Extraction(
                extraction_class="service_type", extraction_text="consulting services"
            ),
            lx.data.Extraction(
                extraction_class="payment_amount", extraction_text="5000"
            ),
            lx.data.Extraction(
                extraction_class="delivery_time_days", extraction_text="30"
            ),
            lx.data.Extraction(
                extraction_class="governing_law", extraction_text="California"
            ),
        ],
    )
]

数据集

你可以在这里找到完整的数据集。下面是合同的示例:

This Contract Agreement ("Agreement") is made and entered into on February 2, 2025, by and between:
* Contractor: GreenLeaf Landscaping Co.

* Contractee: Robert Jenkins

Purpose: Garden maintenance and landscaping for private residence.
Terms and Conditions:
   1. The Contractor agrees to pay the Contractee the sum of $3,200 for the services.

   2. The Contractee agrees to provide landscaping and maintenance services for a period of 3 months.

   3. This Agreement shall terminate on May 2, 2025.

   4. This Agreement shall be governed by the laws of California.

   5. Both parties accept the conditions stated herein.

Signed:
GreenLeaf Landscaping Co.
Robert Jenkins

有些数据在文档中是明确写出的,但其他值可以由模型推断并转换。例如,日期将被格式化为 dd/MM/yyyy,期限以月为单位的会转换为天数。

运行提取

在 Colab notebook 中,你可以用以下方式加载文件:

files.upload()

LangExtract 使用 lx.extract 函数提取字段和值。必须对每个合同调用它,传入内容、提示、示例和模型 ID。

contract_files = glob.glob("*.txt")

print(f"Found {len(contract_files)} contract files:")

for i, file_path in enumerate(contract_files, 1):
    filename = os.path.basename(file_path)
    print(f"\t{i}. {filename}")

results = []

for file_path in contract_files:
    filename = os.path.basename(file_path)

    with open(file_path, "r", encoding="utf-8") as file:
        content = file.read()

        # Run the extraction
        contract_result = lx.extract(
            text_or_documents=content,
            prompt_description=contract_prompt_description,
            examples=contract_examples,
            model_id="gemini-2.5-flash",
        )

        results.append(contract_result)

为了更好地理解提取过程,我们可以将提取结果保存为 NDJSON 文件:

NDJSON_FILE = "extraction_results.ndjson"

# Save the results to a JSONL file
lx.io.save_annotated_documents(results, output_name=NDJSON_FILE, output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize(NDJSON_FILE)

HTML(html_content.data)

lx.visualize(NDJSON_FILE)  生成一个 HTML 可视化,其中包含单个文档的引用,你可以看到数据被提取的具体行。

从一份合同中提取的数据如下所示:

{
  "extractions": [
    {
      "extraction_class": "contract_date",
      "extraction_text": "02/02/2025",
      "char_interval": null,
      "alignment_status": null,
      "extraction_index": 1,
      "group_index": 0,
      "description": null,
      "attributes": {}
    },
    {
      "extraction_class": "service_provider",
      "extraction_text": "GreenLeaf Landscaping Co.",
      ...
    },
    {
      "extraction_class": "client",
      "extraction_text": "Robert Jenkins",
      ...
    },
    {
      "extraction_class": "service_type",
      "extraction_text": "Garden maintenance and landscaping for private residence",
      ...
    },
    {
      "extraction_class": "payment_amount",
      "extraction_text": "3200",
      ...
    },
    {
      "extraction_class": "delivery_time_days",
      "extraction_text": "90",
      ...
    },
    {
      "extraction_class": "end_contract_date",
      "extraction_text": "05/02/2025",
      ...
    },
    {
      "extraction_class": "governing_law",
      "extraction_text": "California",
      ...
    }
  ],
  "text": "This Contract Agreement (\"Agreement\") is made and entered into on February 2, 2025, by and between:\n* Contractor: GreenLeaf Landscaping Co.\n\n* Contractee: Robert Jenkins\n\nPurpose: Garden maintenance and landscaping for private residence.\nTerms and Conditions:\n   1. The Contractor agrees to pay the Contractee the sum of $3,200 for the services.\n\n   2. The Contractee agrees to provide landscaping and maintenance services for a period of 3 months.\n\n   3. This Agreement shall terminate on May 2, 2025.\n\n   4. This Agreement shall be governed by the laws of California.\n\n   5. Both parties accept the conditions stated herein.\n\nSigned:\nGreenLeaf Landscaping Co.\nRobert Jenkins",
  "document_id": "doc_5a65d010"
}

基于此结果,我们将把数据索引到 Elasticsearch 并进行查询。

查询数据

将数据索引到 Elasticsearch

我们使用  _bulk  API 将数据导入到 contracts 索引中。我们将把每个 extraction_class 结果存储为新字段,并将 extraction_text 作为这些字段的值。

def build_data(ndjson_file, index_name):
    with open(ndjson_file, "r") as f:
        for line in f:
            doc = json.loads(line)

            contract_doc = {}

            for extraction in doc["extractions"]:
                extraction_class = extraction["extraction_class"]
                extraction_text = extraction["extraction_text"]

                contract_doc[extraction_class] = extraction_text

            contract_doc["raw_contract"] = doc["text"]

            yield {"_index": index_name, "_source": contract_doc}

try:
    success, errors = helpers.bulk(es_client, build_data(NDJSON_FILE, INDEX_NAME))
    print(f"{success} documents indexed successfully")

    if errors:
        print("Errors during indexing:", errors)
except Exception as e:
    print(f"Error: {str(e)}")

有了这些,我们就可以开始编写查询了:

10 documents indexed successfully

查询数据

现在,让我们查询已过期且付款金额大于或等于 15,000 的合同。

try:
    response = es_client.search(
        index=INDEX_NAME,
        source_excludes=["raw_contract"],
        body={
            "query": {
                "bool": {
                    "filter": [
                        {"range": {"payment_amount": {"gte": 15000}}},
                        {"range": {"end_contract_date": {"lte": "now"}}},
                    ]
                }
            }
        },
    )

    print(f"\nTotal hits: {response['hits']['total']['value']}")

    for hit in response["hits"]["hits"]:
        doc = hit["_source"]

        print(json.dumps(doc, indent=4))

except Exception as e:
    print(f"Error searching index: {str(e)}")

结果如下:

{
    "contract_date": "01/08/2025",
    "service_provider": "MobileDev Innovations",
    "client": "Christopher Lee",
    "service_type": "Mobile application development for fitness tracking and personal training",
    "payment_amount": "18200",
    "delivery_time_days": "100",
    "end_contract_date": "04/18/2025",
    "governing_law": "Colorado"
},
{
    "contract_date": "01/22/2025",
    "service_provider": "BlueWave Marketing Agency",
    "client": "David Thompson",
    "service_type": "Social media marketing campaign and brand development for startup company",
    "payment_amount": "15600",
    "delivery_time_days": "120",
    "end_contract_date": "05/22/2025",
    "governing_law": "Florida"
},
{
    "contract_date": "02/28/2025",
    "service_provider": "CloudTech Solutions Inc.",
    "client": "Amanda Foster",
    "service_type": "Cloud infrastructure migration and setup for e-commerce platform",
    "payment_amount": "22400",
    "delivery_time_days": "75",
    "end_contract_date": "05/15/2025",
    "governing_law": "Washington"
}

结论

LangExtract 使从非结构化文档中提取结构化信息变得更容易,具有清晰的映射并可追溯回源文本。结合 Elasticsearch,这些数据可以被索引和查询,从而能够对合同字段(如日期、付款金额和当事人)进行筛选和搜索。

在我们的示例中,我们保持数据集简单,但相同的流程可以扩展到更大的文档集合或不同领域,如法律、金融或医疗文本。你还可以尝试更多提取示例、自定义提示或额外的后处理,以优化结果以满足你的特定用例。

 

原文: https://www.elastic.co/search-labs/blog/langextract-elasticsearch-tutorial-usage-example

                                                                                </div>



Source link

未经允许不得转载:紫竹林-程序员中文网 » 使用 LangExtract 和 Elasticsearch

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
关于我们 免责申明 意见反馈 隐私政策
程序员中文网:公益在线网站,帮助学习者快速成长!
关注微信 技术交流
推荐文章
每天精选资源文章推送
推荐文章
随时随地碎片化学习
推荐文章
发现有趣的