Introduction
Almost no one can imagine our life without presentations and demos via videoconferences. With this in mind, think about the case when you are going along the street or on the train to another city. Meanwhile, you have to attend a video conference. Suddenly you have a small internet bandwidth. That small that you cannot view the presentation and hear the voices without corruption. The possible solution will be to switch to so-called «text-only» mode.
There is no such mode yet, but let’s outline a concept of the alternative GUI for, let’s say, Google Meet. In this mode, you can read the presentation slides and the voices transcribed in the correct order.
The possible use case is to review all the information in close to real-time delay because text information is not that heavy. In addition, you may review the details you have missed because of the bad internet in the chat manner. And, hopefully, after a certain time, you may get the good internet back. From there on you continue looking at the presentation by switching back to the normal mode.
Another non-obvious advantage is a clear and precise text meeting summary because only some encounter such network issues.
During this article, we will focus on the process of correctly transcribing presentations using OpenAI API models.
The idea behind the scene
Transcription is an easy task using the power of the existing OCR techniques. Yeah, it is mostly the case, but the problems arise when we have rearranged blocks of text on the image. Often it’s almost impossible to order text correctly without looking at the content from the human perspective. It’s where the multimodal models come in.
We will use the advancements in the large multimodal models and test the performance of the OpenAI GPT4-Turbo on the given task.
Pipeline
DataSet
There are 2 main stages:
- prepare the samples for processing;
- process the samples using the OpenAI API.
Images preparation
Firsthand, the slide images may be obtained by downloading the presentation as separate images in the Canva environment. In the next step, we ought to encode our local pictures to send them to OpenAI API.
# Function to encode the image | |
def encode_image(image_path): | |
with open(image_path, "rb") as image_file: | |
return base64.b64encode(image_file.read()).decode('utf-8') |
OpenAI API
After the input data preparation, we use the OpenAI API ChatCompletion request. In addition, we set the «temperature» to «0» and «detail» to «high» (it makes more sense for slides where we have small font text blocks).
headers = { | |
"Content-Type": "application/json", | |
"Authorization": f"Bearer {openai.api_key}" | |
} | |
payload = { | |
"model": model, | |
"messages": [ | |
{ | |
"role": "system", | |
"content": [ | |
{ | |
"type": "text", | |
"text": system_prompt | |
} | |
] | |
}, | |
{ | |
"role": "user", | |
"content": [ | |
{ | |
"type": "image_url", | |
"image_url": { | |
"url": f"data:image/jpeg;base64,{base64_image}", | |
"detail": "high" | |
} | |
} | |
] | |
} | |
], | |
"max_tokens": 4096, | |
"temperature": 0, | |
"seed": 0 | |
} | |
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload) | |
response_content = response.json()['choices'][0]['message']['content'] | |
print(response_content) |
Previously GPT4V was used for this purpose, but when the article is prepared, we have a better option: use the GPT4-Turbo for the vision tasks too.
The following prompt was used to get the ordered text from the slides
You are a helpful assistant. Separate the slide title if it exists. Group all the text blocks from the presentation slide into the subsequent text blocks. The order of the blocks should be preserved like a human have read this presentation.
Results
### Slide Title: | |
GOALS AND OBJECTIVES | |
### Text Blocks: | |
1. **Innovation Leadership** | |
- Fostering a culture of creativity and exploration, encouraging research and development initiatives, and consistently introducing groundbreaking robotics solutions to the market. | |
2. **Customer Satisfaction and Engagement** | |
- Actively seeking and incorporating customer feedback into product development, providing exceptional customer support, and fostering long-term relationships with clients. | |
3. **Talent Development and Retention** | |
- Offering competitive salaries and benefits, providing opportunities for professional growth and advancement. |
On slide 1 we correctly detected and combined text in the consecutive blocks.
The image displays a timeline with key milestones for a project spread across four months: | |
- **January**: Algorithm Development | |
- **February**: Hardware Integration | |
- **March**: Field Testing and Optimization | |
- **April**: Regulatory Compliance and Certification | |
Each month is represented by a numbered location marker, indicating the sequence of the project's phases. |
On slide 2 we lacked the title so the model provided its vision of the title (does make sense if we want to follow slide structure). Overall performance is great and the model got this slide.
### Slide Title: VALUES | |
#### Mission | |
- **Innovation Excellence:** Pursue excellence in innovation by fostering a culture of creativity, curiosity, and continuous learning to stay at the forefront of technological advancement. | |
- **Integrity and Accountability:** Uphold the highest standards of integrity, transparency, and accountability in all our interactions, ensuring trust and reliability in our relationships with clients, partners, and stakeholders. | |
#### Values | |
- **Empower Industries:** Harness the potential of robotics to revolutionize industries, enhance productivity, and drive innovation in diverse sectors worldwide. | |
- **Foster Sustainability:** Develop robotics solutions that contribute to sustainable practices, minimize environmental impact, and promote responsible use of technology for the betterment of society. | |
#### Vision |
The model correctly detected text blocks and even some text from the pictures, but it got confused with sublines (by changing Values with Vision and swapping Mission and Vision text blocks). Here is the place for additional prompt engineering and extra algorithms on top of this.
### Slide Title: | |
DEVELOP A SOLID BUSINESS PLAN | |
### Text Blocks: | |
**PROJECT 1** | |
- Create a comprehensive business plan that outlines your company's vision, mission, goals, target market, competitive landscape, and marketing strategy. | |
**PROJECT 2** | |
- Define a clear value proposition that communicates the unique benefits of your product or service to customers and sets you apart from competitors. | |
**PROJECT 3** | |
- Develop a scalable business model that allows for growth and expansion over time. | |
GPT4-Turbo correctly understood the structure and the sequence of the projects on the given slide.
### Slide Title: | |
STRATEGIES | |
### Text Blocks: | |
1. **Strategy N°1** | |
- Identify a Niche and Validate the Market | |
2. **Strategy N°2** | |
- Develop a Solid Business Plan and Strategy | |
3. **Strategy N°3** | |
- Build a Strong Team and Network |
The model handled the transcription of the provided slide without any problems.
Conclusion
The GPT4-Turbo handles the slide transcription process with high quality. This process is not yet perfect, but it’s just a place to start improving. There is plenty of work to reach a production-ready performance and reliability. All the code may be found here. Thanks for reading till the end of the article. Don’t hesitate to share your opinion or ideas on what can be done differently.