Vision-Agent学习笔记

agent CV

Word count: 2.7kReading time: 13 min

 2024/07/26 

Vision-Agent学习笔记

这里先附上VisionAgent可以完成的功能

vision_agent实现的功能其实是 vision_agent.tools 决定的，所以我们check一下工具列表即可：

blip_image_caption：基于图像内容生成描述性文本。
clip：对图像进行分类或标签，并返回概率评分。
closest_box_distance：计算两个边界框之间的最近距离。
closest_mask_distance：计算两个掩码之间的最近距离。
extract_frames：从视频中提取帧，返回帧和时间戳的列表。
git_vqa_v2：根据问题和图像回答有关图像内容的问题。
grounding_dino：根据文本提示检测和计数多个对象，返回边界框、标签和概率评分。
grounding_sam：根据文本提示检测和分割多个对象，返回边界框、标签、掩码和概率评分。
load_image：从给定的文件路径加载图像。
loca_visual_prompt_counting：根据视觉提示（边界框）计算图像中的主要前景对象数量。
loca_zero_shot_counting：无需其他信息，计算图像中的主要前景对象数量。
ocr：从图像中提取文本，返回检测到的文本、边界框和置信度评分的列表。
overlay_bounding_boxes：在图像上显示边界框。
overlay_heat_map：在图像上显示热图。
overlay_segmentation_masks：在图像上显示分割掩码。
owl_v2：根据文本提示检测和计数多个对象，返回边界框、标签和概率评分。
save_image：将图像保存到指定的文件路径。
save_json：将数据保存为 JSON 文件，支持包含 NumPy 数组的数据。
save_video：将帧列表保存为 MP4 视频文件。
vit_image_classification：对图像进行分类，并返回类别和概率评分。
vit_nsfw_classification：对图像进行 NSFW（不适合工作场所）分类，并返回预测标签和概率评分。

尝试VisionAgent的功能

先来认识VisionAgent,先简单的做一个标注任务

import os
# 获取环境变量 OPENAI_API_KEY 的值
# os.environ["OPENAI_API_KEY"] ="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"
api_key = os.environ.get("OPENAI_API_KEY")
print(api_key)

1
2
3

from vision_agent.agent import VisionAgent
agent = VisionAgent()
code = agent("请帮我标注出图中的无人机，并监测无人机之间的距离", media="/Users/mac/Documents/drone/vision-agent-main/photo.png")

这里我们可以打印结果出来看看

1	code['test_result'].results[0]

1	code['test_result'].results[1]

{'text/plain': "{'distances': [{'drone1': {'score': 0.12,\n    'label': 'drone',\n    'bbox': [0.46, 0.45, 0.48, 0.49]},\n   'drone2': {'score': 0.11,\n    'label': 'drone',\n    'bbox': [0.37, 0.49, 0.39, 0.53]},\n   'distance': 130.0}],\n 'annotated_image_path': '/Users/mac/Documents/drone/vision-agent-main/annotated_image.png'}"}

1 2	from vision_agent.tools import blip_image_caption blip_image_caption("/Users/mac/Documents/drone/vision-agent-main/photo.png")

1	'a satellite image of the city of wuzhou'

### 图片分类----clip(image,classes)
from vision_agent.tools import clip
from PIL import Image
import numpy as np
image_path = "/Users/mac/Documents/drone/vision-agent-main/photo.png"
# 加载图像为 PIL 对象
image = Image.open(image_path)

# 确保图像是 NumPy 数组（可选，根据 clip 函数要求）
image_np = np.array(image)

classes = ["car","drone","map"]
result = clip(image_np,classes)

1	{'labels': ['car', 'drone', 'map'], 'scores': [0.0019, 0.0577, 0.9404]}

拆解visionAgent

这里我们先来看看一共用到了哪些Agent

本质上的思想是通过，定义好一批Action_tools,使用mutiple-agents的去顺序执行，拿到结果，并反思结果，修改流程，得到最终的答案

这里来拆解一下flows

flows

GetUserQuestion
PlanAgent：
- input: UserQuestion,tool_desc,feedback(option)
- 采用COT将FullTask 拆解成SubTask
- 参考previous Feedback（初始为None）
- 根据subTask查询可用工具库
- return json：instructions
REFLECTAgent:
- 检查代码逻辑，能否实现用户问题
- input: code,plan
- return: json:feedback;success(bool)
codeAgent:
- Algorithm/Method Selection,Pseudocode Creation,Translate your pseudocode into executable Python code.
- input: functions_docstring,subTask,feedBack
- return executable Python code
TestAgent:
- 写测试案列并测试结果
- 三种测试方案：1.assert function；2:function_call 然后 visualize the output并检验
- 可用调用E2B的CodeInterpreter去在虚拟终端执行代码并返回结果
- input:function_docs,question,code
- return : test_code
FixBugAgent:
- find the error in the code and fix it
- 可用调用noteBook进行代码执行并返回结果
- input: code/test_code
- return: josn: reflections,code,test

这里备注一下，tool_seclect的时候是根据，PlanAgent划分的子任务与工具库里的每一个工具的介绍进行的相似度计算查找最相关的tools

这里有一个问题，这个每一个工具库的描述文档是如何实现的呢？

这里看vision_agent.tools.prompts.py和Class OpenAILMM：def generate_classifier()函数源码可以发现其实是通过LMM来生成一个json格式的函数名和对应描述
‘Example 1: {{"Parameters":{{"keyword": "Artificial Intelligence", "language": "English"}}}}\n’;
这里附上源码

vision_agents.tools.prompts.py

# vision_agents.tools.prompts.py
SYSTEM_PROMPT = "You are a helpful assistant."

# EasyTool prompts
CHOOSE_PARAMS = (
    "This is an API tool documentation. Given a user's question, you need to output parameters according to the API tool documentation to successfully call the API to solve the user's question.\n"
    "This is the API tool documentation: {api_doc}\n"
    "Please note that: \n"
    "1. The Example in the API tool documentation can help you better understand the use of the API.\n"
    '2. Ensure the parameters you output are correct. The output must contain the required parameters, and can contain the optional parameters based on the question. If there are no parameters in the required parameters and optional parameters, just leave it as `{{"Parameters":{{}}}}`\n'
    "3. If the user's question mentions other APIs, you should ONLY consider the API tool documentation I give and do not consider other APIs.\n"
    '4. If you need to use this API multiple times, please set "Parameters" to a list.\n'
    "5. You must ONLY output in a parsible JSON format. Two example outputs look like:\n"
    "'''\n"
    'Example 1: `{{"Parameters":{{"keyword": "Artificial Intelligence", "language": "English"}}}}`\n'
    'Example 2: `{{"Parameters":[{{"keyword": "Artificial Intelligence", "language": "English"}}, {{"keyword": "Machine Learning", "language": "English"}}]}}`\n'
    "'''\n"
    "This is the user's question: {question}\n"
    "Output:\n"
)

vision_agents.tools.lmm.lmm.py

# vision_agents.tools.lmm.lmm.py
# Class OpenAILMM：def generate_classifier()函数
import vision_agent.tools as T
from vision_agent.tools.prompts import CHOOSE_PARAMS, SYSTEM_PROMPT

    def generate_classifier(self, question: str) -> Callable:
        api_doc = T.get_tool_documentation([T.clip])
        prompt = CHOOSE_PARAMS.format(api_doc=api_doc, question=question)
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": prompt},
            ],
        )

        try:
            params = json.loads(cast(str, response.choices[0].message.content))[
                "Parameters"
            ]
        except json.JSONDecodeError:
            _LOGGER.error(
                f"Failed to decode response: {response.choices[0].message.content}"
            )
            raise ValueError("Failed to decode response")

        return lambda x: T.clip(x, params["prompt"])

这里get_tool_documentation函数实际上将传入的可调用的函数的func_name,func_signature:func_doc连成了一个字符串，这个字符串就是api_doc

get_tool_documentation函数

# 这里看一下get_tool_documentation函数
def get_tool_documentation(funcs: List[Callable[..., Any]]) -> str:
    docstrings = ""
    for func in funcs:
        docstrings += f"{func.__name__}{inspect.signature(func)}:\n{func.__doc__}\n\n"

    return docstrings

在工具类里vision_agent.tools.tools也有一函数批实现这个功能，主要是为了动态添加

# 在工具类里vision_agent.tools.tools也有一函数批实现这个功能，主要是为了动态添加
# TOOLS_DF = get_tools_df(TOOLS)
TOOLS = [
    owl_v2,
    grounding_sam,
    extract_frames,
    ocr,
    clip,
    vit_image_classification,
    vit_nsfw_classification,
    loca_zero_shot_counting,
    loca_visual_prompt_counting,
    git_vqa_v2,
    blip_image_caption,
    closest_mask_distance,
    closest_box_distance,
    save_json,
    load_image,
    save_image,
    save_video,
    overlay_bounding_boxes,
    overlay_segmentation_masks,
    overlay_heat_map,
]
TOOLS_DF = get_tools_df(TOOLS)  # type: ignore

def get_tools_df(funcs: List[Callable[..., Any]]) -> pd.DataFrame:
    data: Dict[str, List[str]] = {"desc": [], "doc": []}

    for func in funcs:
        desc = func.__doc__
        if desc is None:
            desc = ""
        desc = desc[: desc.find("Parameters:")].replace("\n", " ").strip()
        desc = " ".join(desc.split())

        doc = f"{func.__name__}{inspect.signature(func)}:\n{func.__doc__}"
        data["desc"].append(desc)
        data["doc"].append(doc)

    return pd.DataFrame(data)  # type: ignore

在vision_agent.agent.vision_agent.py函数中，retrieve_tools函数返回的实际上只有函数名，签名和描述的返回，在这种情况下codeAgent如何知道tools的具体内容并调用呢？

这里作者用了一个很巧妙的解法，在prompt里写明了：通过导入from vision_agent.tools import *，

在这种情况下，LLM并不需要知道函数具体事如何实现的，只需要知道函数用来干什么，传入什么参数，得到什么结果，然后在程序里调用即可
源码里的prompt：
Documentation:
This is the documentation for the functions you have access to. You may call any of these functions to help you complete the task. They are available through importing from vision_agent.tools import *.
{docstring}

这里的Plan在代码层面是如何设计的？

这里看vision_agent.agent.vision_agent.py中的Class VsionAgent源码可以看出
实际上是通过一个chat_with_workflow()实现的
这里的workFlow负责顺序调用上诉Agent，然后将返回结果组合在一个json数据中，
call函数调用chat_with_workflow()，将结果返回
附上源码

这里附上一个流程图

## 核心的处理流程
   def chat_with_workflow(
       self,
       chat: List[Message],
       self_reflection: bool = False,
       display_visualization: bool = False,
   ) -> Dict[str, Any]:
       """Chat with Vision Agent and return intermediate information regarding the task.

       Parameters:
           chat (List[MediaChatItem]): A conversation
               in the format of:
               [{"role": "user", "content": "describe your task here..."}]
               or if it contains media files, it should be in the format of:
               [{"role": "user", "content": "describe your task here...", "media": ["image1.jpg", "image2.jpg"]}]
           self_reflection (bool): Whether to reflect on the task and debug the code.
           display_visualization (bool): If True, it opens a new window locally to
               show the image(s) created by visualization code (if there is any).

       Returns:
           Dict[str, Any]: A dictionary containing the code, test, test result, plan,
               and working memory of the agent.
       """

       if not chat:
           raise ValueError("Chat cannot be empty.")

       # NOTE: each chat should have a dedicated code interpreter instance to avoid concurrency issues
       with CodeInterpreterFactory.new_instance() as code_interpreter:
           chat = copy.deepcopy(chat)
           media_list = []
           for chat_i in chat:
               if "media" in chat_i:
                   for media in chat_i["media"]:
                       media = code_interpreter.upload_file(media)
                       chat_i["content"] += f" Media name {media}"  # type: ignore
                       media_list.append(media)

           int_chat = cast(
               List[Message],
               [{"role": c["role"], "content": c["content"]} for c in chat],
           )

           code = ""
           test = ""
           working_memory: List[Dict[str, str]] = []
           results = {"code": "", "test": "", "plan": []}
           plan = []
           success = False
           retries = 0

           while not success and retries < self.max_retries:
               self.log_progress(
                   {
                       "type": "plans",
                       "status": "started",
                   }
               )
               # 写计划
               plan_i = write_plan(
                   int_chat,
                   T.TOOL_DESCRIPTIONS,
                   format_memory(working_memory),
                   self.planner,
               )
               plan_i_str = "\n-".join([e["instructions"] for e in plan_i])

               # 记录当前进度
               self.log_progress(
                   {
                       "type": "plans",
                       "status": "completed",
                       "payload": plan_i,
                   }
               )

               if self.verbosity >= 1:
                   _LOGGER.info(
                       f"\n{tabulate(tabular_data=plan_i, headers='keys', tablefmt='mixed_grid', maxcolwidths=_MAX_TABULATE_COL_WIDTH)}"
                   )

               # 根据计划，查找工具
               tool_info = retrieve_tools(
                   plan_i,
                   self.tool_recommender,
                   self.log_progress,
                   self.verbosity,
               )

               # 写代码测试代码
               results = write_and_test_code(
                   chat=int_chat,
                   tool_info=tool_info,
                   tool_utils=T.UTILITIES_DOCSTRING,
                   working_memory=working_memory,
                   coder=self.coder,
                   tester=self.tester,
                   debugger=self.debugger,
                   code_interpreter=code_interpreter,
                   log_progress=self.log_progress,
                   verbosity=self.verbosity,
                   media=media_list,
               )

               # 根据测试结果决定是否需要Reflect
               success = cast(bool, results["success"])
               code = cast(str, results["code"])
               test = cast(str, results["test"])
               working_memory.extend(results["working_memory"])  # type: ignore
               plan.append({"code": code, "test": test, "plan": plan_i})

               if not self_reflection:
                   break

               #反思
               reflection = reflect(
                   int_chat,
                   FULL_TASK.format(
                       user_request=chat[0]["content"], subtasks=plan_i_str
                   ),
                   code,
                   self.planner,
               )
               # 反思完后继续进入循环

           execution_result = cast(Execution, results["test_result"])

           if display_visualization:
               for res in execution_result.results:
                   if res.png:
                       b64_to_pil(res.png).show()
                   if res.mp4:
                       play_video(res.mp4)

           return {
               "code": DefaultImports.prepend_imports(code),
               "test": test,
               "test_result": execution_result,
               "plan": plan,
               "working_memory": working_memory,
           }

Author：Yangyehan

Link：http://example.com/2024/07/26/VisionAgent%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/

Publish date：July 26th 2024, 8:22:51 pm

Update date：August 28th 2024, 9:56:22 am

License：本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

Next Post

7-27日思考
Previous Post

全参模型加载

CATALOG

1. Vision-Agent学习笔记