韩宇 commited on
Commit
a9f3a3a
·
1 Parent(s): 1b7e88c

init readme.md

Browse files
Files changed (1) hide show
  1. README.md +10 -126
README.md CHANGED
@@ -1,127 +1,11 @@
1
- # Video Understanding Example
 
 
 
 
 
 
 
 
 
2
 
3
- This example demonstrates how to use the framework for hour-long video understanding task. The example code can be found in the `examples/video_understanding` directory.
4
-
5
- ```bash
6
- cd examples/video_understanding
7
- ```
8
-
9
- ## Overview
10
-
11
- This example implements a video understanding task workflow based on the DnC workflow, which consists of following components:
12
-
13
- 1. **Video Preprocess Task**
14
- - Preprocess the video with audio information via speech-to-text capability
15
- - It detects the scene boundaries, splits the video into several chunks and extract frames at specified intervals
16
- - Each scene chunk is summarized by MLLM with detailed information, cached and updated into vector database for Q&A retrieval
17
- - Video metadata and video file md5 are transferred for filtering
18
-
19
- 2. **Video QA Task**
20
- - Take the user input question about the video
21
- - Retrieve related information from the vector database with the question
22
- - Extract the approximate start and end time of the video segment related to the question
23
- - Generate video object from serialized data in short-term memory(stm)
24
- - Build init task tree with the question to DnC task
25
-
26
- 3. **Divide and Conquer Task**
27
- - Execute the task tree with the question
28
- - Detailed information is referred to the [DnC Example](./DnC.md#overview)
29
-
30
- The system uses Redis for state management, Milvus for long-tern memory storage and Conductor for workflow orchestration.
31
-
32
- ### This whole workflow is looked like the following diagram:
33
-
34
- ![Video Understanding Workflow](docsmages/video_understanding_workflow_diagram.png)
35
-
36
- ## Prerequisites
37
-
38
- - Python 3.10+
39
- - Required packages installed (see requirements.txt)
40
- - Access to OpenAI API or compatible endpoint (see configs/llms/*.yml)
41
- - [Optional] Access to Bing API for WebSearch tool (see configs/tools/*.yml)
42
- - Redis server running locally or remotely
43
- - Conductor server running locally or remotely
44
-
45
- ## Configuration
46
-
47
- The container.yaml file is a configuration file that manages dependencies and settings for different components of the system, including Conductor connections, Redis connections, and other service configurations. To set up your configuration:
48
-
49
- 1. Generate the container.yaml file:
50
- ```bash
51
- python compile_container.py
52
- ```
53
- This will create a container.yaml file with default settings under `examples/video_understanding`.
54
-
55
-
56
- 2. Configure your LLM and tool settings in `configs/llms/*.yml` and `configs/tools/*.yml`:
57
- - Set your OpenAI API key or compatible endpoint through environment variable or by directly modifying the yml file
58
- ```bash
59
- export custom_openai_key="your_openai_api_key"
60
- export custom_openai_endpoint="your_openai_endpoint"
61
- ```
62
- - [Optional] Set your Bing API key or compatible endpoint through environment variable or by directly modifying the yml file
63
- ```bash
64
- export bing_api_key="your_bing_api_key"
65
- ```
66
- **Note: It isn't mandatory to set the Bing API key, as the WebSearch tool will rollback to use duckduckgo search. But it is recommended to set it for better search quality.**
67
- - The default text encoder configuration uses OpenAI `text-embedding-3-large` with **3072** dimensions, make sure you change the dim value of `MilvusLTM` in `container.yaml`
68
- - Configure other model settings like temperature as needed through environment variable or by directly modifying the yml file
69
-
70
- 3. Update settings in the generated `container.yaml`:
71
- - Modify Redis connection settings:
72
- - Set the host, port and credentials for your Redis instance
73
- - Configure both `redis_stream_client` and `redis_stm_client` sections
74
- - Update the Conductor server URL under conductor_config section
75
- - Configure MilvusLTM in `components` section:
76
- - Set the `storage_name` and `dim` for MilvusLTM
77
- - Set `dim` is to **3072** if you use default OpenAI encoder, make sure to modify corresponding dimension if you use other custom text encoder model or endpoint
78
- - Adjust other settings as needed
79
- - Configure hyper-parameters for video preprocess task in `examples/video_understanding/configs/workers/video_preprocessor.yml`
80
- - `use_cache`: Whether to use cache for the video preprocess task
81
- - `scene_detect_threshold`: The threshold for scene detection, which is used to determine if a scene change occurs in the video, min value means more scenes will be detected, default value is **27**
82
- - `frame_extraction_interval`: The interval between frames to extract from the video, default value is **5**
83
- - `kernel_size`: The size of the kernel for scene detection, should be **odd** number, default value is automatically calculated based on the resolution of the video. For hour-long videos, it is recommended to leave it blank, but for short videos, it is recommended to set a smaller value, like **3**, **5** to make it more sensitive to the scene change
84
- - `stt.endpoint`: The endpoint for the speech-to-text service, default uses OpenAI ASR service
85
- - `stt.api_key`: The API key for the speech-to-text service, default uses OpenAI API key
86
- - Adjust any other component settings as needed
87
-
88
- ## Running the Example
89
-
90
- 1. Run the video understanding example via Webpage:
91
-
92
- ```bash
93
- python run_webpage.py
94
- ```
95
-
96
- First, select a video or upload a video file on the left; after the video preprocessing is completed, ask questions about the video content on the right.
97
-
98
-
99
- 2. Run the video understanding example, currently only supports CLI usage:
100
-
101
- ```bash
102
- python run_cli.py
103
- ```
104
-
105
- First time you need to input the video file path, it will take a while to preprocess the video and store the information into vector database.
106
- After the video is preprocessed, you can input your question about the video and the system will answer it. Note that the agent may give the wrong or vague answer, especially some questions are related the name of the characters in the video.
107
-
108
- ## Troubleshooting
109
-
110
- If you encounter issues:
111
- - Verify Redis is running and accessible
112
- - Try smaller `scene_detect_threshold` and `frame_extraction_interval` if you find too many scenes are detected
113
- - Check your OpenAI API key is valid
114
- - Check your Bing API key is valid if search results are not as expected
115
- - Check the `dim` value in `MilvusLTM` in `container.yaml` is set correctly, currently unmatched dimension setting will not raise error but lose partial of the information(we will add more checks in the future)
116
- - Ensure all dependencies are installed correctly
117
- - Review logs for any error messages
118
- - **Open an issue on GitHub if you can't find a solution, we will do our best to help you out!**
119
-
120
-
121
- 4. Run the video understanding example, currently only supports Webpage usage:
122
-
123
- ```bash
124
- python run_webpage.py
125
- ```
126
-
127
- First, select a video or upload a video file on the left; after the video preprocessing is completed, ask questions about the video content on the right.
 
1
+ ---
2
+ title: OmAgent
3
+ emoji: 💬
4
+ colorFrom: yellow
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.0.1
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11