Apache Doris AI 能力揭秘(三):AI_AGG 与 EMBED 函数深度解析


                                                                                                                                                <p>在初步探索了 AI 函数的可能性之后,本次我们将目光投向两个更为核心的函数:<code>AI_AGGEMBED。我们将深入解析这两个函数的设计理念、实现原理及其在业务场景中的应用,展示 Apache Doris 如何通过原生的函数设计,将文本聚合与语义向量分析无缝集成到 SQL 中,为用户提供更强大、更易用的智能数据分析体验。

相关阅读:

AI_AGG:基于 AI 的文本聚合

聚合是数据分析中最常见的操作,但如果聚合的对象是海量的用户评论、支持工单或日志文本,传统的聚合函数难以直接处理这类非结构化文本数据。为此,Doris 支持了 AI_AGG,一个能够调用 AI 对文本进行聚合的函数。它让分析师可以根据自定义指令,从大量的文本中处理特定的任务。

01 使用方法及示例

AI_AGG 详细用法请参考: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/aggregate-functions/ai-agg

示例 1:

下表模拟一个简易的客服工单:

SELECT * FROM support_tickets;
+-----------+---------------+------------------+----------------------------------------------------------------------------------+
| ticket_id | customer_name | subject          | details                                                                          |
+-----------+---------------+------------------+----------------------------------------------------------------------------------+
|         2 | Bob           | Login Failure    | Same problem as Alice. Also seeing 502 errors on the SSO page.                   |
|         3 | Carol         | Payment Declined | Credit card charged twice but order still shows pending.                         |
|         5 | Eve           | Login Failure    | Getting redirected back to login after entering 2FA code.                        |
|         1 | Alice         | Login Failure    | Cannot log in after password reset. Tried clearing cache and different browsers. |
|         4 | Dave          | Slow Dashboard   | Dashboard takes >30 seconds to load since the last release.                      |
+-----------+---------------+------------------+----------------------------------------------------------------------------------+

我们可以通过 AI_AGG 对不同问题类型下客户的问题进行总结

SELECT
    subject,
    AI_AGG(
        details,
        'Summarize every ticket detail into one short paragraph'
    ) AS ai_summary
FROM support_tickets
GROUP BY subject;

输出示例如下:

+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subject          | ai_summary                                                                                                                                                              |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Slow Dashboard   | The dashboard is experiencing slow loading times, taking over 30 seconds to load following the most recent release.                                                     |
| Payment Declined | A customer reports being charged twice for their order, which remains in a pending status.                                                                              |
| Login Failure    | Users are experiencing login issues, including 2FA redirection, post-password reset failures, and SSO 502 errors, despite clearing cache and trying different browsers. |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

示例 2:

下表简易模拟了应用反馈表:

SELECT * FROM app_feedback;
+-------------+----------------+-----------+--------------------------------------------------------------------------------------+
| feedback_id | feature_module | user_name | feedback_text                                                                        |
+-------------+----------------+-----------+--------------------------------------------------------------------------------------+
|           1 | 机票预订       | 李雷      | 搜索航班非常快,价格也透明,筛选功能很好用。                                         |
|           2 | 机票预订       | 韩梅梅    | 支付后出票速度有点慢,等了快半小时,希望能改进。                                     |
|           3 | 机票预订       | 小陈      | 希望能增加更多廉价航空公司的选项。                                                   |
|           4 | 机票预订       | 马丽      | App在选择日期的时候偶尔会卡顿,体验不是很好。                                        |
|           5 | 酒店预订       | 王先生    | 酒店信息很全,图片也真实。但是取消预订的流程太复杂了。                               |
|           6 | 酒店预订       | 刘女士    | 通过App预订比其他平台便宜,还有会员折扣,非常满意。                                  |
|           7 | 行程规划       | 赵四      | 行程规划功能太棒了,可以自动推荐路线和景点,省了不少心。                             |
|           8 | 客户服务       | 孙小小    | 联系客服很方便,问题解决得也很快,给客服点赞。                                       |
|           9 | 客户服务       | 钱多多    | 电话客服总是占线,在线客服回复又很慢,希望能增加人手。                               |
+-------------+----------------+-----------+--------------------------------------------------------------------------------------+

我们可以通过 AI_AGG 对用户反馈进行总结:

SELECT
    feature_module AS '功能模块',
    AI_AGG(
        feedback_text,
        '请用精确且尽可能简短地总结这些用户反馈中提到的主要问题和建议'
    ) AS 'AI反馈总结'
FROM app_feedback
GROUP BY feature_module;

输出示例如下

+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 功能模块     | AI反馈总结                                                                                                                                                                                                     |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 客户服务     | 主要问题:客服电话占线,在线客服回复慢。  
主要建议:增加客服人手。                                                                                                                                                |
| 行程规划     | 主要问题:无  
主要建议:赞赏行程规划功能,建议继续优化                                                                                                                                                        |
| 酒店预订     | 主要问题:取消预订流程复杂。
主要建议:无明确建议,但肯定App预订价格优势和会员折扣。                                                                                                                          |
| 机票预订     | 主要问题:支付后出票慢(约半小时),App日期选择卡顿,廉价航空公司选项不足。
主要建议:优化出票速度,修复App卡顿问题,增加廉价航空公司选项。                                                                   |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

02 AI_AGG 的实现原理

将聚合函数与 AI 结合,需要解决一个分组内文本总量可能远超模型上下文窗口的问题。如果将所有文本拼接后一次性发送给 AI, 极易出现拼接文本大于模型的最大上下文窗口的情况,而 Doris 中通过动态预聚合解决了这个问题:

02 AI_AGG 的实现原理.png

  1. 上下文监控:在聚合拼接文本的过程中,AI_AGG 会为每一个分组维护一个内部的文本缓冲区(目前大小固定为 128K,为绝大部分 AI 可兼容的上下文窗口)
  2. 动态预聚合:当新的文本行将要被加入缓冲区并会导致其大小超过阈值时,AI_AGG 会触发一次预聚合。它会暂停接收新数据,将当前缓冲区内的所有文本作为一个批次发送给 AI 进行一次中间任务处理。
  3. 上下文替换:AI 返回的中间处理结果,其长度通常远小于原始文本。AI_AGG 会用这个精炼后的摘要替换掉缓冲区内原有的长文本,从而为处理更多的新文本行腾出空间。为保证聚合过程的稳定性和防止超出模型服务上限,在替换原有文本后,若加入当前文本行还是会导致缓冲区超过阈值,AI_AGG 会直接报错退出。

通过这种实现方式,AI_AGG 可以完全融入 Doris 的分布式查询计划,利用多节点并行计算,并由数据库自动管理聚合过程中的中间状态。因此,用户可以用熟悉的 SQL 聚合语法,在海量文本上实现高效的智能分析。

EMBED: 文本向量化函数

EMBED 函数的核心功能是通过 AI 将任意文本转换为高维度的浮点数向量。这个向量是文本在语义空间中的一种数学表示,它捕捉了文本的语义信息。语义相近的文本,其向量在空间中的距离也更近。

EMBED 详细用法请参考: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/ai-functions/distance-functions/embed

01 使用方法及示例

示例 1:

下表模拟简易的行为手册

CREATE TABLE knowledge_base (
    id BIGINT,
    title STRING,
    content STRING,
    embedding ARRAY<FLOAT> COMMENT 'Semantic vector generated by EMBED function'
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES (
    "replication_num" = "1"
);

INSERT INTO knowledge_base (id, title, content, embedding) VALUES
(1, "Travel Reimbursement Policy",
    "Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.",
    EMBED("travel reimbursement policy")),
(2, "Leave Policy",
    "Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.",
    EMBED("leave request policy")),
(3, "VPN User Guide",
    "To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.",
    EMBED("VPN guide intranet access")),
(4, "Meeting Room Reservation",
    "Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.",
    EMBED("meeting room booking reservation")),
(5, "Procurement Request Process",
    "Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.",
    EMBED("procurement request process finance"));

通过 EMBED 函数对文本的向量化操作,结合 Doris 支持的向量函数, 可对数据进行如下操作:

  1. 问答检索(结合 COSINE_DISTANCE
SELECT 
    id, title, content,
    COSINE_DISTANCE(embedding, EMBED("How to apply for travel reimbursement?")) AS score
FROM knowledge_base
ORDER BY score ASC
LIMIT 2;
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| id   | title                       | content                                                                                                                                 | score              |
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    1 | Travel Reimbursement Policy | Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.        | 0.4463210454563673 |
|    5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. | 0.5726841578491431 |
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------+
  1. 问题分析匹配(结合 L2_DISTANCE
SELECT 
    id, title, content,
    L2_DISTANCE(embedding, EMBED("How to access the company intranet")) AS distance
FROM knowledge_base
ORDER BY distance ASC
LIMIT 2;
+------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| id   | title                       | content                                                                                                                                     | distance           |
+------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    3 | VPN User Guide              | To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate. | 0.5838271122253775 |
|    1 | Travel Reimbursement Policy | Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.            |  1.272394695975331 |
+------+-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
  1. 根据文章内容进行文本相关度匹配并推荐(结合INNER PRODUCT
SELECT 
    id, title, content,
    INNER_PRODUCT(embedding, EMBED("Leave system request leader approval")) AS score
FROM knowledge_base
WHERE id != 2
ORDER BY score DESC
LIMIT 2;
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id   | title                       | content                                                                                                                                 | score               |
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|    5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. |    0.33268885332504 |
|    4 | Meeting Room Reservation    | Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.                         | 0.29224032230852487 |
+------+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+---------------------+
  1. 寻找差异较小的内容(结合L1_DISTANCE
SELECT 
    id, title, content,
    L1_DISTANCE(embedding, EMBED("Procurement application process")) AS distance
FROM knowledge_base
ORDER BY distance ASC
LIMIT 3;
+------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| id   | title                       | content                                                                                                                                        | distance           |
+------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    5 | Procurement Request Process | Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.        |  18.66882028897362 |
|    4 | Meeting Room Reservation    | Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.                                |  30.90449328294426 |
|    2 | Leave Policy                | Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required. | 31.060405636536416 |
+------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+

EMBED 的设计不仅是一个文本向量化工具,更是 Doris 向量分析生态的重要组成部分。 通过与其他向量函数(COSINE_DISTANCEL2_DISTANCEINNER_PRODUCTL1_DISTANCE )的无缝集成,EMBED 支持用户在 SQL 查询中完成从文本到向量,再到相似度计算或检索的一站式分析。

示例 2:

下表模拟简易的客服常见问题文档:

CREATE TABLE support_docs (
    id BIGINT,
    title STRING,
    content STRING,
    embedding ARRAY<FLOAT> COMMENT '由 EMBED 函数生成的语义向量'
) DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ( "replication_num" = "1" );

INSERT INTO support_docs (id, title, content, embedding) VALUES
(1, "系统登录问题排查",
    "当用户无法登录时,请先检查网络连接、浏览器缓存,并确认账号未被锁定。",
    EMBED("登录问题排查 网络异常 账号锁定")),
(2, "数据备份与恢复指南",
    "客户可在管理后台手动备份数据,若误删可通过支持团队申请数据恢复。",
    EMBED("数据备份 恢复 操作指南")),
(3, "账单与发票问题",
    "企业客户可在财务模块下载电子发票。如需纸质版,请提交申请工单。",
    EMBED("账单 发票 财务模块")),
(4, "API 接口调用规范",
    "开发者在调用 API 时需携带正确的访问令牌,否则将返回身份验证错误。",
    EMBED("接口调用 认证错误 访问令牌")),
(5, "服务中断应急流程",
    "若出现服务中断,技术团队需在30分钟内启动应急响应并发布公告。",
    EMBED("服务中断 应急响应 处理流程"));

通过 EMBED 函数对文本的向量化操作,结合 Doris 支持的向量函数, 可对数据进行如下操作:

  1. 问答检索(结合 COSINE_DISTANCE
SELECT 
    id, title, content,
    COSINE_DISTANCE(embedding, EMBED("登录不上系统怎么办?")) AS score
FROM support_docs
ORDER BY score ASC
LIMIT 2;
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
| id   | title                    | content                                                                                                | score     |
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
|    1 | 系统登录问题排查         | 当用户无法登录时,请先检查网络连接、浏览器缓存,并确认账号未被锁定。                                   | 0.3183002 |
|    4 | API 接口调用规范         | 开发者在调用 API 时需携带正确的访问令牌,否则将返回身份验证错误。                                      | 0.5599254 |
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
  1. 问题分析匹配(结合 L2_DISTANCE
SELECT 
    id, title, content,
    L2_DISTANCE(embedding, EMBED("接口调用时提示没有权限")) AS distance
FROM support_docs
ORDER BY distance ASC
LIMIT 2;
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
| id   | title                    | content                                                                                                | distance  |
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
|    4 | API 接口调用规范         | 开发者在调用 API 时需携带正确的访问令牌,否则将返回身份验证错误。                                      | 0.6471552 |
|    1 | 系统登录问题排查         | 当用户无法登录时,请先检查网络连接、浏览器缓存,并确认账号未被锁定。                                   | 0.9831962 |
+------+--------------------------+--------------------------------------------------------------------------------------------------------+-----------+
  1. 根据文章内容进行文本相关度匹配并推荐(结合INNER PRODUCT
SELECT 
    id, title, content,
    INNER_PRODUCT(embedding, EMBED("账单 发票 报销流程")) AS score
FROM support_docs
ORDER BY score DESC
LIMIT 2;
+------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+
| id   | title                    | content                                                                                       | score     |
+------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+
|    3 | 账单与发票问题           | 企业客户可在财务模块下载电子发票。如需纸质版,请提交申请工单。                                | 0.8098868 |
|    5 | 服务中断应急流程         | 若出现服务中断,技术团队需在30分钟内启动应急响应并发布公告。                                  | 0.3729638 |
+------+--------------------------+-----------------------------------------------------------------------------------------------+-----------+
  1. 寻找差异较小的内容(结合L1_DISTANCE
SELECT 
    id, title, content,
    L1_DISTANCE(embedding, EMBED("服务中断 处理 指南")) AS distance
FROM support_docs
ORDER BY distance ASC
LIMIT 3;
+------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+
| id   | title                       | content                                                                                                | distance |
+------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+
|    5 | 服务中断应急流程            | 若出现服务中断,技术团队需在30分钟内启动应急响应并发布公告。                                           | 13.94832 |
|    2 | 数据备份与恢复指南          | 客户可在管理后台手动备份数据,若误删可通过支持团队申请数据恢复。                                       | 24.65827 |
|    1 | 系统登录问题排查            | 当用户无法登录时,请先检查网络连接、浏览器缓存,并确认账号未被锁定。                                   |  24.9747 |
+------+-----------------------------+--------------------------------------------------------------------------------------------------------+----------+

02 灵活的向量维度控制

通过 Doris 内置的 RESOURCE 机制,用户在配置 AI Resource 时,可以设置 ai.dimensions 参数来精确指定生成向量的维度。用户可以根据具体的业务场景和性能考量,选择生成高维向量以保留更丰富的语义信息,或选择低维向量以节约存储空间和加速计算。这种灵活性使得 EMBED 函数能够更好地适应从轻量级语义匹配到高精度向量检索等多样化的分析需求,让用户在成本与效果之间找到最佳平衡。

注意:在使用 dimensions 参数时,请务必确认 RESOURCE 中配置的模型支持您所指定的维度,否则可能导致请求错误。此外,Doris 内部对部分不支持维度定制的模型(例如 OpenAI 的 text-embedding-ada-002)做了限制。对于这些模型,即使在 RESOURCE 中设置了 dimensions 参数,该设置也将被忽略,函数将返回模型默认的维度。

总结与展望

借助AI_AGGEMBED 函数,Apache Doris 拥有了了强大的智能分析能力,极大地拓展了数据分析与智能应用的边界。AI_AGG 通过其动态预聚合机制,将非结构化文本的智能聚合分析带入数据库,轻松应对海量用户评论、日志分析等场景。而 EMBED 函数则与 Doris 的向量函数无缝集成,提供从文本到语义向量、再到相似度检索的一站式解决方案,极大简化了问答系统、内容推荐等应用的开发。这些功能使得 SQL 语言本身具备了驾驭 AI 模型的能力,让每一位数据分析师都能以低成本、高效率的方式,挖掘数据中更深层次的语义价值。

展望未来,Doris 将继续深化 AI 与数据库的融合。我们将致力于优化模型调度与计算性能,并探索更多如多模态数据分析、AI Agent 交互等前沿功能,持续降低 AI 技术的使用门槛,让数据驱动的智能决策无处不在。

                                                                                </div>



Source link

未经允许不得转载:紫竹林-程序员中文网 » Apache Doris AI 能力揭秘(三):AI_AGG 与 EMBED 函数深度解析

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
关于我们 免责申明 意见反馈 隐私政策
程序员中文网:公益在线网站,帮助学习者快速成长!
关注微信 技术交流
推荐文章
每天精选资源文章推送
推荐文章
随时随地碎片化学习
推荐文章
发现有趣的