Yangyehan&UndGround.

关于LLM格式化输出的一点思考

Word count: 3.2kReading time: 17 min
2024/02/03

输出解释器 :output Parsers

下面我们介绍下输出解析(Output Parsers)的使用场景。假如我们现在有到一些商品的评价,想要对这些评价做一些处理从而可以更好的分析数据。比如,我们期望可以根据输入用户的商品评价,输出对应的 JOSN 数据,包含以下字段信息:

gift:是否是把商品当作礼物, bool 类型;
delivery_days:商品配送时间,没有的话返回 -1 ;
price_value:提起有关价格相关的信息,返回一个列表;

我们期望LLM返回下面输出(JSON格式):

1
2
3
4
5
{
"gift": False,
"delivery_days": 5,
"price_value": "pretty affordable!"
}

接下来我们先来看看langchain给出的output Parsers解决方案,在本文的末尾,笔者会带着大家根据其原理,自己构建一个output Parsers json输出解释器

1
2
3
!pip install langchain
!pip install openai
!pip install langchain_openai
1
2
3
4
5
from langchain.llms import OpenAI
# from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
1
2
3
4
import os

os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
customer_review = """\
This leaf blower is pretty amazing. It has four settings:\
candle blower, gentle breeze, windy city, and tornado. \
It arrived in two days, just in time for my wife's \
anniversary present. \
I think my wife liked it so much she was speechless. \
So far I've been the only one using it, and I've been \
using it every other morning to clear the leaves on our lawn. \
It's slightly more expensive than the other leaf blowers \
out there, but I think it's worth it for the extra features.
"""

review_template = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product \
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: {text}
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from langchain.prompts import ChatPromptTemplate

api_key='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'
llm_model = 'gpt-3.5-turbo-0301'


# 创建chatPromptTemplate
prompt_template = ChatPromptTemplate.from_template(review_template)
print(prompt_template)

messages = prompt_template.format_messages(text=customer_review)

# 创建LLM
chat = ChatOpenAI(temperature=0.0, model=llm_model, openai_api_key=api_key)

# 请求
response = chat(messages)
print(type(response.content))

input_variables=['text'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['text'], template='For the following text, extract the following information:\n\ngift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.\n\ndelivery_days: How many days did it take for the product to arrive? If this information is not found, output -1.\n\nprice_value: Extract any sentences about the value or price,and output them as a comma separated Python list.\n\nFormat the output as JSON with the following keys:\ngift\ndelivery_days\nprice_value\n\ntext: {text}\n'))]
<class 'str'>
1
2
response.content
print(response.content)
{
    "gift": true,
    "delivery_days": 2,
    "price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}
1
response.content.get('gift')
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In[73], line 1
----> 1 response.content.get('gift')


AttributeError: 'str' object has no attribute 'get'
error :我们不能直接对response返回的结果直接操纵,目前它还只是一个字符串
1
2
3
4
import json
output_dict = json.loads(response.content)
print(output_dict)
print(output_dict['gift'])
{'gift': True, 'delivery_days': 2, 'price_value': ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]}
True

这个地方笔者认为有一些问题,我尝试了使用json包将字符串转化为json格式,然后再操作,居然成功了,按道理在这种情况下,我们可以不用添加下面的{format_instructions}prompt,

但是langchain官方,给出了添加{format_instructions}prompt,方法,来使得LLM输出json格式,这里只能猜测,给定了{format_instructions}prompt后LLM的输出会变得更稳定

笔者在写这篇文章的时候,经过多次尝试确实如果不加入{format_instructions}prompt,确实会存在输出reponse无法直接解析的案列

因此让我们一起来看一下官方给出的方法及封装

Parse the LLM output string into a Python dictionary

1
2
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gift_schema = ResponseSchema(name="gift",
description="Was the item purchased\
as a gift for someone else? \
Answer True if yes,\
False if not or unknown.")
delivery_days_schema = ResponseSchema(name="delivery_days",
description="How many days\
did it take for the product\
to arrive? If this \
information is not found,\
output -1.")
price_value_schema = ResponseSchema(name="price_value",
description="Extract any\
sentences about the value or \
price, and output them as a \
comma separated Python list.")

response_schemas = [gift_schema,
delivery_days_schema,
price_value_schema]

相当于langchain封装了一个函数用于LLM格式化输出的prompt,ResponseSchema(name:””,description:””),name是key,description是对key的注释

1
response_schemas
[ResponseSchema(name='gift', description='Was the item purchased                             as a gift for someone else?                              Answer True if yes,                             False if not or unknown.', type='string'),
 ResponseSchema(name='delivery_days', description='How many days                                      did it take for the product                                      to arrive? If this                                       information is not found,                                      output -1.', type='string'),
 ResponseSchema(name='price_value', description='Extract any                                    sentences about the value or                                     price, and output them as a                                     comma separated Python list.', type='string')]
1
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
1
format_instructions = output_parser.get_format_instructions()
1
print(format_instructions)
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

1
2
3
4
5
{
"gift": string // Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.
"delivery_days": string // How many days did it take for the product to arrive? If this information is not found, output -1.
"price_value": string // Extract any sentences about the value or price, and output them as a comma separated Python list.
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
review_template_2 = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product\
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

text: {text}

{format_instructions}
"""

prompt = ChatPromptTemplate.from_template(template=review_template_2)

messages = prompt.format_messages(text=customer_review,
format_instructions=format_instructions)
1
print(messages[0].content)
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the productto arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,and output them as a comma separated Python list.

text: This leaf blower is pretty amazing.  It has four settings:candle blower, gentle breeze, windy city, and tornado. It arrived in two days, just in time for my wife's anniversary present. I think my wife liked it so much she was speechless. So far I've been the only one using it, and I've been using it every other morning to clear the leaves on our lawn. It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features.


​ The output should be a markdown code snippet formatted in the following schema, including the leading and trailing “json" and "“:

1
2
3
4
5
{
"gift": string // Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.
"delivery_days": string // How many days did it take for the product to arrive? If this information is not found, output -1.
"price_value": string // Extract any sentences about the value or price, and output them as a comma separated Python list.
}
1
response = chat(messages)
1
print(response.content)
1
2
3
4
5
{
"gift": true,
"delivery_days": "2",
"price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}

相较于没有添加{format_instructions}部分,llm输出会省去很多的/n等乱字符

1
print(type(response.content))
<class 'str'>
1
2
json_str = json.loads(response.content) 
print(json_str)
{'gift': True, 'delivery_days': 2, 'price_value': ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]}

在这个地方依旧无法使用json.loads转化为json格式,但是官方给出了parse封装的解析json的方法

1
output_dict = output_parser.parse(response.content)
1
output_dict
{'gift': True,
 'delivery_days': 2,
 'price_value': ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]}

写在最后,整个看下来,langchain官方给出的方法,其背后原理也不过是在做prompt工程,通过提示词让LLM来规范输出,然后给了一堆华丽的方法来封装各种制作prompt的接口,抽象的接固然好,对于一些复用率高的prompt是及其有效的。但是感觉过度的封装就有点稍显复杂,prompt的功能笔者认为就是让LLM进行角色扮演,来激发(活)LLM的某些“能力(参数)”,所以对于prompt的服用率也不过体现在一些拼接User_inputh和prev_prompt的地方,因此,我们不妨使用python最原始的方案来制作一个使得LLM规范输出的方案

1
2
3
4
5
6
7
8
9
# 导入LLM
from langchain_openai import ChatOpenAI

# 创建一个ChatMessage的列表
from langchain.schema import HumanMessage


# 创建LLM
chat = ChatOpenAI(temperature=0.0, model=llm_model, openai_api_key=api_key)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
text = """\
This leaf blower is pretty amazing. It has four settings:\
candle blower, gentle breeze, windy city, and tornado. \
It arrived in two days, just in time for my wife's \
anniversary present. \
I think my wife liked it so much she was speechless. \
So far I've been the only one using it, and I've been \
using it every other morning to clear the leaves on our lawn. \
It's slightly more expensive than the other leaf blowers \
out there, but I think it's worth it for the extra features.
"""

format_instructions = """\
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
"gift": string // Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.
"delivery_days": string // How many days did it take for the product to arrive? If this information is not found, output -1.
"price_value": string // Extract any sentences about the value or price, and output them as a comma separated Python list.
}

“””

review_template = f”””
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else?
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,
and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: {text}

{format_instructions}
“””

print(review_template)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: This leaf blower is pretty amazing. It has four settings:candle blower, gentle breeze, windy city, and tornado. It arrived in two days, just in time for my wife's anniversary present. I think my wife liked it so much she was speechless. So far I've been the only one using it, and I've been using it every other morning to clear the leaves on our lawn. It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features.


The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
"gift": string // Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.
"delivery_days": string // How many days did it take for the product to arrive? If this information is not found, output -1.
"price_value": string // Extract any sentences about the value or price, and output them as a comma separated Python list.
}

1
2
3
# 构建一个chatMessage列表
messages = [HumanMessage(content=review_template)]
print(messages[0].content)
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: This leaf blower is pretty amazing.  It has four settings:candle blower, gentle breeze, windy city, and tornado. It arrived in two days, just in time for my wife's anniversary present. I think my wife liked it so much she was speechless. So far I've been the only one using it, and I've been using it every other morning to clear the leaves on our lawn. It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features.


​ The output should be a markdown code snippet formatted in the following schema, including the leading and trailing “json" and "“:

1
2
3
4
5
{
"gift": string // Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.
"delivery_days": string // How many days did it take for the product to arrive? If this information is not found, output -1.
"price_value": string // Extract any sentences about the value or price, and output them as a comma separated Python list.
}

1
2
response = chat(messages)
print(response.content)
1
2
3
4
5
{
"gift": true,
"delivery_days": 2,
"price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}

这里我们自己写一个基于正则表达式的提取器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import re


def parse_json(input_str: str):

# 定义正则表达式模式,用于匹配以```json开头,以```结尾的字符串
pattern = r'```json(.*?)```'

# 使用正则表达式搜索匹配的内容
match = re.search(pattern, input_str, re.DOTALL)

if match:
# 提取匹配到的JSON字符串
json_str = json.loads(match.group(1))

return json_str
else:
return None
1
2
3
input_str = parse_json(response.content)
print(input_str)
print(type(input_str))
{'gift': True, 'delivery_days': 2, 'price_value': ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]}
<class 'dict'>
1
input_str.get("price_value")
["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]

大公告成,我们可以基于prompt的原理+正则表达式的提取器来实现一个LLM输出解释器output Parsers

CATALOG
  1. 1. 输出解释器 :output Parsers
    1. 1.0.1. 我们期望LLM返回下面输出(JSON格式):
    2. 1.0.2. 接下来我们先来看看langchain给出的output Parsers解决方案,在本文的末尾,笔者会带着大家根据其原理,自己构建一个output Parsers json输出解释器
      1. 1.0.2.1. error :我们不能直接对response返回的结果直接操纵,目前它还只是一个字符串
  • 2. 这个地方笔者认为有一些问题,我尝试了使用json包将字符串转化为json格式,然后再操作,居然成功了,按道理在这种情况下,我们可以不用添加下面的{format_instructions}prompt,
  • 3. 但是langchain官方,给出了添加{format_instructions}prompt,方法,来使得LLM输出json格式,这里只能猜测,给定了{format_instructions}prompt后LLM的输出会变得更稳定
  • 4. 笔者在写这篇文章的时候,经过多次尝试确实如果不加入{format_instructions}prompt,确实会存在输出reponse无法直接解析的案列
  • 5. 因此让我们一起来看一下官方给出的方法及封装
  • 6. Parse the LLM output string into a Python dictionary
    1. 6.1. 相当于langchain封装了一个函数用于LLM格式化输出的prompt,ResponseSchema(name:””,description:””),name是key,description是对key的注释
      1. 6.1.1. 在这个地方依旧无法使用json.loads转化为json格式,但是官方给出了parse封装的解析json的方法
  • 7. 写在最后,整个看下来,langchain官方给出的方法,其背后原理也不过是在做prompt工程,通过提示词让LLM来规范输出,然后给了一堆华丽的方法来封装各种制作prompt的接口,抽象的接固然好,对于一些复用率高的prompt是及其有效的。但是感觉过度的封装就有点稍显复杂,prompt的功能笔者认为就是让LLM进行角色扮演,来激发(活)LLM的某些“能力(参数)”,所以对于prompt的服用率也不过体现在一些拼接User_inputh和prev_prompt的地方,因此,我们不妨使用python最原始的方案来制作一个使得LLM规范输出的方案
  • 8. 这里我们自己写一个基于正则表达式的提取器:
    1. 8.1. 大公告成,我们可以基于prompt的原理+正则表达式的提取器来实现一个LLM输出解释器output Parsers