General information
There are video content analysis services available on the market (e.g., Google Cloud Platform – Video Intelligence, Amazon Rekognition or Microsoft Video Indexer). These tools offer advanced video analysis, but it is important to consider limitations related to the lack of flexibility and possibility of on-premises data processing and integration with other solutions. Thanks to the development of artificial intelligence and Large Language Models, many new possibilities have emerged in the field of video content analysis and indexing.
Therefore, we wanted to design and create a solution that would consist of customizable and flexible modules in response to the needs reported by our customers. We developed 2 new components of the Matena package: Matena Analyzer (used for analyzing and indexing video content) and Matena Knowledge (storing and presenting the results of processing). Matena Analyzer can be used for face recognition, location and object detection, transcription, diarization, tagging, flagging adult content, OCR and many others. To build this system, we used the latest artificial intelligence solutions. What distinguishes our system is that it is flexible, open to the use of solutions from various providers and prepared for the possibility that new solutions may appear in the near future.
The target groups for our solution are media groups and companies associated with the media industry as well as individuals and companies dealing with multimedia libraries management in general. Thanks to the Matena products, our clients have the ability to perform more efficient searches in their multimedia archives and individual files – they can do this based on, among other things, people, locations, objects, words or tags. Additionally, they can monetize their indexed data.
System overview
Modularity and the possibility of installing Matena on premises
The main architectural assumption of our system is modularity. Thanks to it, we can add, activate, and deactivate individual components that are or are not needed by the user at a given moment. The system modules can be combined into pipelines depending on the needs of the clients. The following modules are currently available in Matena:
- Matena Analyzer:
- – Face recognition module (with database of faces)
- – Transcription and diarization module
- – Module describing the material with a Large Language Model
- – Integration with other video analysis tools
- Matena Knowledge – central data repository of analyzed materials
An important aspect is that the full version of Matena can be installed on premises. Additionally, the quality of the analysis provided by our system is comparable to the quality of the results generated by GPT.
Face recognition
One of the main modules available in Matena Analyzer is face recognition in video material. For this purpose, we implemented a face recognition algorithm and applied affine transformations to maximize the model’s effectiveness. Additionally, we created an algorithm for matching recognized faces and objects in a video sequence between consecutive frames. We designed our system to efficiently search through a large set of faces. The face recognition module also supports training with new faces added by the users and creating their own databases.
Because international face databases are not addressing the needs of companies in our country, we needed to build the face database for the face recognition module from scratch. It includes about 5000 Polish public figures. It was a long process involving manual and automatic downloading of publicly available photos of well-known people from the internet. Face recognition was tested multiple times, and the database itself was frequently cleaned and adjusted to ensure the face recognition module works as effectively as possible.
Transcription and diarization
One of the components influencing the accuracy of video material description is what the people appearing in it are saying. Therefore, our system analyzes both the image and the text of the material. We used speech-to-text technology to get transcriptions of individual frames and the entire material. For this purpose, Microsoft Speech and OpenAI’s Whisper are used. Diarization is also applied, which means dividing the transcription into individual speakers.
Describing the material with a language model
In addition to the face recognition and transcription module, an important element of our system is a component that extracts other details from the video and audio material. These are pieces of information such as sentiment, tags and labels describing the frame, objects or places. This functionality is created using the GPT-4 model. The system can also be integrated with other models (e.g. Bielik – an open source Polish language model), depending on the needs of a given user. We applied prompt engineering to receive concise, structured responses from the LLM. Prompts can be adjusted to the users’ needs if they want to process their materials for some specific purpose.
Integration with other video analysis services
Matena can be integrated with other video analysis and indexing services, for example with Microsoft Video Indexer. Such integration can be implemented if the user wants to use specific software. Data provided by other services can be cleaned and used to enrich Matena Knowledge database.
The potential of Matena Knowledge – central video data repository
We created a component for storing and presenting the results of processing called Matena Knowledge – it is a central repository of data from analyzed materials. Each video material is divided into keyframes, which are analyzed separately, and then a general summary is created from all frames of the given material with an LLM. Users can therefore browse details about each frame as well as the entire material. This module is designed in such a way that users have the ability to browse analyzed materials in an efficient manner.
As such a repository, Matena Knowledge can be especially useful for media companies – they can use it for cognitive search, RAG systems or other business applications.
Outcome
We created a system that enables the analysis and indexing of video content. A significant advantage of our solution is that its components are flexible and can be customized to the individual needs of each client, and we are not limited to using services only from one vendor.
We are continuously developing Matena tools and we are planning to introduce additional modules, such as sound effects or music recognition. The feedback and suggestions from our users are of great importance to us, and we always make our best efforts to tailor our solutions to these suggestions.
Industry
Media industry
Keywords
Multimedia libraries management, face recognition, speech to text, STT, large language model, LLM
Technologies
Microsoft Speech, Whisper, GPT-4, Bielik, Llama