Senior Data Scientist @ RTL Netherlands
A common challenge for teams working on video machine learning applications is how to scale and automate their ML lifecycle when working with these types of large unstructured datasets.
A common challenge for teams working on video machine learning applications is how to scale and automate their ML lifecycle when working with these types of large unstructured datasets.
In this webinar, Vincent Koops, Senior Data Scientist at RTL Netherlands, will walk through their Video AI platform at RTL and how they’ve addressed these challenges.
Their platform is built on top of Pachyderm and Kubernetes to enable a wide range of ML applications such as automatic thumbnail picking and mid-roll marking.
Originally broadcasted on Data Science Central
Sean: Good morning, good afternoon, and good evening to all of our attendees joining us today for this latest Data Science Central webinar. This is Sean Welch, your host. I'm the host and producer at Data Science Central. I'd like to start our event off today by thanking Pachyderm for sponsoring today's event. Pachyderm is a longtime supporter of the Data Science Central community, and we're honored to have them sponsoring our event today.
Today's webinar is entitled, "AI vs Unstructured Data: Best Practices for Scaling Video AI," to be presented by Pachyderm. And before we begin, I'd like to briefly review the format of today's webinar. Today's event will be one hour long. We have one presenter who I'll introduce in just a minute, and there will be a 10-to-15-minute Q&A following the presentation. And this event is being recorded and will be available on DataScienceCentral.com later this afternoon, following today's live event. I'd also like to encourage our attendees to provide questions throughout the presentation. We will be reviewing and presenting them on your behalf during the Q&A portion of today's event.
Well, I'm very pleased to introduce today's speaker, Vincent Koops with RTL Netherlands. Vincent is an AI researcher and composer holding degrees in sound design and music composition from the HKU University of the Arts Utrecht and degrees in artificial intelligence from the Utrecht University. After a research internship at Carnegie Mellon University, he completed his PhD in music information retrieval at Utrecht University. Currently, he is a senior data scientist at RTL Netherlands, working on AI multimedia projects. He is responsible for developing a scalable video intelligence platform to make video content more discoverable, searchable, and valuable. He develops AI solutions to automatically analyze music and video content, automatically generate movie trailers for different devices, and methods to automatically pick thumbnails for video content for the VOD platform video layout. Vincent is also a co-organizer of the International AI Song Contest, in which teams compete to write a song with AI.
Thanks for being with us today, Vincent. We're looking forward to your presentation. A common challenge for teams working on video machine learning applications is how to scale and automate their ML lifecycle when working with these types of large, unstructured datasets. In today's Data Science Central webinar, Vincent Koops, senior data scientist at RTL Netherlands, will walk through their video AI platform at RTL and how they've addressed these challenges. Their platform is built on top of Pachyderm and Kubernetes to enable a wide range of ML applications such as automatic thumbnail picking and mid-roll marking. Today, you will learn how to take a modular approach to creating a scalable and automated ML platform, the challenges and best practices when working with unstructured data like video clips, and considerations your teams need to make to prevent human error while getting the most out of AI and ML. Vincent, with that, I'm going to turn it over to you. You can begin as soon as you're ready to go.
Vincent: Thank you very much, Sean, for the kind introduction. Yeah, so today I wanted to talk about video AI and our approach to it at RTL Netherlands. But before I do that, I wanted to give a quick introduction to who we are and where I work. So I work at RTL Netherlands. And RTL Netherlands is part of Bertelsmann. This is the logo on top on the left side. So Bertelsmann is an international media company. Penguin Random House and BMG, the record company, are part of it, for example. And part of Bertelsmann is RTL Group, and RTL Netherlands is part of RTL Group. And RTL Netherlands, we are the largest commercial broadcaster in the Netherlands, and we have linear TV channels, of which we have about 28% audience share. We have AVOD and SVOD channels, of which RTL XL and Videoland are the largest. So Videoland is basically our Netflix competitor in the Netherlands. And we have other brands like weather channels like [inaudible], and a very large news organization as well. So with our linear television, we reach about nine million people a day, and with our online video, about 780 online views per month, and some 2.3 million unique visitors across our digital publishing efforts.
So I'm part of the data science team. So we have about six data scientists in our team. And you can imagine with all these channels and all these digital channels, we gather a ton of interesting data that needs to be analyzed. So we do a wide range of things like forecasting ratings on television, but also on video-on-demand. We build our recommendation systems, we do research into robot journalism, and we also publish robot journalism articles. And so we also do video AI. And in this call, we also have Daan Odijk, who is the lead data scientist of our team who is also available at the end of the call to take some questions.
All right. So there are basically three things that I wanted to discuss in this talk. So first is video AI at RTL. What is our vision, and what do we mean with that, and what are we actually trying to solve? And the second thing is how do we put it into practice with Pachyderm. And then I wanted to zoom in into two use cases that I think nicely show what we're capable of with our solution. So the first one is AI fun wheel selection, so picking the right images or thumbnails on our video-on-demand platform. And the second one is deciding where to place an ad on our video-on-demand platform content.
All right. Our vision at RTL Netherlands when it comes to artificial intelligence is basically that we want to use it to make sure we optimally use our human intelligence. So we want to put it into practice where we want to facilitate what humans are good at, so being creative, associate freely, and connect to the hearts and minds of our fans. So we want to use artificial intelligence to boost that or help our creative people to basically be better at what they're doing, more creative. Because most of the things that we deal with with our video AI applications have to do with content operations. And content operations is hard, first of all, because it's labor-intensive work. It's just hard work to make something to make videos, to edit videos. But it also takes a lot of human creativity, which is, of course, really hard to replace with AI, but that we can help with AI, that we can bring to a next level with AI.
So some of the things that we try to do or that we are doing at RTL Netherlands is, for example, automatic promo and trailer generation. So that means from a movie, can we automatically select the right scenes, put them in the right order, and create a promo or trailer for TV or for video-on-demand? Well, the thing is can we help our video editors by automatically spotting which part of the content might be interesting for them? So basically, knowing what happens where in a video and providing that information to the editors so they can quickly find content in the video or across videos that helps in their editing process. And then there's intelligent cropping of images and videos. So how can we, if we have some video content, automatically make it in the right aspect ratio for Instagram, for example, where you want to have a square aspect ratio compared to Facebook, where you want to have a five to four aspect ratio, for example? Just some examples that we try to tackle with video AI.
Now, these things are highly complicated, of course. If you would want to build one model that solves each of these problems, you'll get a really large model that's really hard to manage. So creating one single end-to-end model that solves these issues or these problems is not a very good idea. So what we do is we take a modular approach to machine intelligence. So what we do is we take these complex computational tasks and we divide them into simpler subtasks. So instead of creating task-specific models such as a model that takes as input a video and outputs a trailer, for example, we break down these problems or these tasks into very simple and basically solved approaches or things that can be approached very well by artificial intelligence and reuse those elements in an intelligent way so that we can solve these complex tasks. And the approach that we take in this package that we created, this approach we call Videopipe. So this is basically our artificial intelligence products pipeline, and we can use that to create new RTL contents. So these are trailers and promos, for example. But we can also make existing content more discoverable and valuable because we learn more from the content, and we can use all those insights in other aspects of where the content is distributed and used.
All right. So we could extract a lot of information from a video. So a video, of course, consists of separate streams of information. So there's a visual domain, there's an audio domain, and there's a text domain. And we created models for each of these aspects that basically try to solve a relatively simple task, although some of those are still quite hard problems but simple enough that we can use them with care and that we know what the output is and that we can recombine them in an interesting way. So in the visual domain, for example, we can do keyframe extraction, so we know, for example, what a good frame is from a shot in a movie. We can do shot segmentation, so we can automatically cut up a video so that we basically have all the separate shots that make up the entire video. We can compute optical flow, so we can get a sense of how fast a shot is moving, for example. We have models for aesthetics analysis, and so we know how pretty something is that's onscreen. A visual similarity analysis can tell us where content repeats within content, so for example, bumpers or credits that appear across videos. So in that sense, we can detect where the credits starts or where the bumpers are. And we can do face detection and emotion recognition and object detection and a lot of other things.
And then in the audio domain, we can do audio tagging, so we know whether there's a dog barking or a car driving by. We can do speaker segmentation and identification so we know who is speaking where and with what emotion. And if there's music in a movie, for example, we can also analyze that to get a sense of the musical genre or the mood that's part of that scene. Usually with video, there is also text data related to that, so subtitles or a script, those kind of things. And we have models to analyze that text data so we can do language detection, we can do sentiment analysis, we can analyze the key phrases, we can automatically summarize scripts, and tons of other things. What I wanted to show here is that from all these different streams that make up a video, we can extract all these really interesting aspects that are basically single modules that solve one particular task.
And now, if you want to solve more complex tasks, of course we want to combine these things in some kind of way. All right. And what we use for that is Pachyderm. So all these models that I just showed before, we use Pachyderm to scale these, to orchestrate these, and to solve a more complex task using these modules. Because Pachyderm gives us the opportunity or facilitates us to create these flexible solutions from simpler models, and especially because it's data driven and code agnostic. So with data driven, I mean that it's basically the video data, in our case, that basically starts a computation and starts computing all these aspects with these smaller modules downstream. And I'll give an example of that further on in the presentation. What's also important for us is reproducibility. So in our case, that means versioning. So we want to track when we extract data or metadata from a video or other kinds of interesting things, where that was computed from, but also if we have a new video that is maybe a re-edit of a video that was analyzed before, that we can put it into Videopipe again and get updated output. But we can still go back to the old output and go back and see what the output was there to learn from it, but also be able to step back if the previous video turned out to be a better one, for example.
Then scalability is also very important for us because video data is big data. So as you can imagine, a high quality video often commonly has 25 frames per second of high quality images. So scheduling and scaling these pipelines is very important to us. And then with parallel processing, we can also analyze tons of videos at the same time. We get a lot of content sent in every day, and we don't want to be restricted by just be able to analyze one video at a time. If there's a lot of videos coming in, we want to analyze them basically all at the same time. And with Pachyderm, we can do that. So we also use other products in other parts of our organization, or we also have parts of VideoPipe running in Argo Workflows, for example. And this also allows us to schedule pipelines, which works really well for us in some situations. And it can also account for this modularity aspect, but it cannot really account for this reproducibility aspect. And we have tried some other solutions in the past, and for example, we also have solutions with DVC where we can use-- that's also a solution for data versioning. But in this case the benefit of Pachyderm for us is that all of these features are highly coupled in one product, which works really well for us.
All right. So this was part of the vision and the idea behind VideoPipe and our video AI approach. Now I wanted to show how we actually put it into practice. And I wanted to do this by going basically through our GitHub repo and basically show you what's in there because I think it really nicely shows what we have and what we use and what are the elements that are important for us. So there are two things that are very important in VideoPipe. So that's on the first place, a data model, which allows us to merge the output of these individual modules and basically facilitates us to make these more complex models. And then we have our pipelines, which are the individual modules, in essence.
And this data model gives us exactly that. So we can make sure the output of the face detection model is on the frame level. So for each frame, we know which face is in there. And then for the speaker recognition model, we also know per frame, for example, who is speaking. And then combining the data, we now know who is speaking where, on which frame, or which faces are correlating with what speech, which gives us a lot of more possibilities than these individual steps. So this is very important for VideoPipe. And basically what it shows or what it contains is the name of the pipeline, the video source so that we can trace back where this content came from, from which video, and then some other metadata like frames per second duration. We have a data version. If we have an upgrade of this data model, we can increase this number so that we know that this is a newer version, some shape data about the video, and then a dimension across which this data is extracted. So this could be on a seconds level, on a frame level, or a span level. So for example, it starts at second X and ends at second Z, so the data is defined in that span. And then there's the data itself, of course.
All right. The second really important part of VideoPipe are the pipelines, of course. These are the building blocks that gives us the power to solve these complex problems. And basically, all the pipelines are similar in a way. What we have is roughly three or four things. So what we always have is a Dockerfile. So in this Dockerfile, it basically tells us what the base image is, or what the image is that we are using to do our computations. Then we have some code. So in this case it's a Python file, it's a frame extraction Python file. And this code is extracted inside this container or this pot that is spun up and is using that Dockerfile. And then there's the pipeline definition according to the Pachyderm specs. And I'll go through each of these in the next slides. So there's also requirements here, but this is only for-- because in this case, it's Python code, we install these requirements in this Dockerfile. But since it's code agnostic, we could easily have our code in here. Or we also use bash scripts, so we don't have this requirements file at all. It can be whatever. As long as it runs in this Dockerfile and it creates the right output, then it works.
All right. So the first thing I wanted to show is this Dockerfile. It's really simple. We created a base image that basically contains some code and some data that we just want to have there in all cases, just to run tests and also to have the data model in there. So to make sure that the data model is in each of the-- available in each of the pods that it's running in VideoPipe, we use this base image. Then we copy the data. So in line two and three we install some packages that are needed. So FFmpeg, of course, is important when you want to do video editing. Then we install packages with pip, and that's basically it. Then we have our Dockerfile with which we can build our container. And in this container, we're going to run code, of course. So in this case, I wanted to show you a face detection pipeline. So on the left there's a bunch of code. If you want to pause and look through the code, you can do that. But I wanted to take a more high-level approach. So on the right, I made this diagram that basically shows you what's happening in the code. So in this case, the preceding step here is that we have a pipeline that's extracting frames, right? And it's extracting frames, and it's putting them in a folder that has the name of the video file. So now, this pipeline, which runs next, so this is the face detection pipeline that you're looking at. It basically runs for all these folders. It gets all the images in the folder, detects all the faces on all the images, and then puts that output in the data model, and then saves that data model plus some logging files and some other things so that we can make sure that everything ran correctly, and saves that to the output repository. And in this case, it runs this every time the preceding pipeline has finished. So we detect basically all faces on all images for all videos that we extracted frames from. And that's basically all. So it's a quite simple, compact, small pipeline that solves one particular task.
All right. And the last thing that's very important is this pipeline definition. So this is a Pachyderm pipeline definition that makes sure that it picks the right Docker images and runs the right code. So in this case, what I show you here is a frame extraction pipeline. It has a description. It extracts frames, of course. And in the transform part, you can see what it is actually running. So it runs this command. It runs Python, this frame extraction Python file, and it takes as input a videos repository, so PFS/videos. It's a videos repository, and it writes all its output to PFS/out, which is basically the data model and the log files. It writes it there. It pulls an image, so we have an image here that does that. It does frame extraction. It contains the code that's mentioned in the line above that. There's some secrets in there to make sure that we can pull the image. And then what's important on line 11, we define that it takes as input the videos repository. So this just makes sure that we can access the data that we want, in this case, the videos.
All right. So this was a general setup and just an example of the parts that we use to create more complex tasks or to solve more complex tasks, sorry. Because what I showed you until now were relatively simple, tiny parts that of course, we want to combine. And I think these two use cases will give a nice overview of what we can solve by combining different elements. So the first one I wanted to go into is AI thumbnail selection, and the second is ad mid-roll marking. AI thumbnail selection is basically this problem, so this is the web interface of our Videoland video-on-demand platform, and we also have "Keeping Up with the Kardashians," as you can see. And this is season 15. But for all these videos in this season, we need a thumbnail to show you what is going on in the video that basically tells you, "This is an interesting episode. You might want to click on this to watch it." All right. So those are these images. And the way this was done in the past at RTL was that a designer would basically scroll through all the possible frames that are in a video, would scroll through it and try to find one that looks nice. So this has a couple of problems, of course, right? There is too much information here. There are a lot of frames that are similar to each other, so there's a lot of information here that's not necessary to have. And also, there's probably a lot of images here that are quite obvious that they're not useful for using as a thumbnail. And we learn from our designers. We asked them what makes a good image on Videoland? What should it have? So they told us it needs one to three faces. It needs a certain color combination. It needs to be in focus, of course. It shouldn't contain any spoilers, those kind of things. And a lot of these things separately are problems that we can solve with AI.
So what we built is basically this combination of pipelines that solves this problem or helps the designers in solving this problem. So and I'll walk you through it from left to right. So all the way in the left, you see origins of content. So this can be Azure blobster or somewhere on S3 or some other way we can gather videos, but they end up in the videos suppository. So this is the first repository where all the videos land. So from all these videos, we do frame extraction, so depending a little bit on the length of the video, but we extract the videos every second or so. Or if it's a shorter video, we extract more frames. And from these frames, we do face detection. We have a model for aesthetics analysis so we know how pretty the image is. We do some technical analysis. For example, is it in focus or is the color balance correct, those kind of things. And we have a couple of other pipelines that extract information from these frames. And using this data model that I explained before, we can easily combine all that information and make some really interesting choices. Because if we combine them, then we can basically, in one view, look at, okay, which image has two faces, has a particular aesthetics ranking, is in focus, and also has a particular actor, for example. All that information we can use, and we can rank the images and basically push what the AI thinks or what this ranking thinks is the best image. We can push that automatically to Videoland.
So this ranking basically solves two problems. One, we can automatically push the best image to Videoland. So basically when a video is available, it will run through this pipeline, and a thumbnail is published automatically, which saves a lot of time. But secondly, using this ranking, we can also give the designers a top 5 or a top 10 of images. So if the number one ranked image is perhaps not good enough or the designer thinks there are better images, they can pick another one from this list and basically select maybe the second or the third one in the list. However, we actually see that 99% of all AI selected stills are used on Videoland, and they're actually not touched by the editors. And using these pipelines, these combination of pipelines, we actually created a reduction of over a thousand working hours per year.
All right. So the second example that I wanted to show you is ad mid-roll marking. So Videoland has a basic tier subscription which basically means you can watch the content, but you also have to watch ads. And we want to monetize the content that we have optimally, of course. So what we want basically is to place ads in the least intrusive parts of videos. So this is limited by some business rules, of course, but there should be an ad around every 20 minutes or so, but there should be no ads during speech or conversation. And it is preferred to place the ads on scene boundaries. And these things we can solve with VideoPipe. Again, this is the overview of the selection of the parts of the pipelines that we use for solving this problem. And again, on the left, you see the origin of the videos, so Azure S3 or somewhere else. We put them in the videos repository, and we have a couple of models running here that extract information from these videos that we can use to solve this problem. So we do shot detection, for example. So we know where one shot transfers into another and we can cut the video on those parts. We know where those cuts are. And we do that by looking at color histograms, for example. So when a color histogram changes from frame to frame in subsequent frames, we know that there's a high probability that there's a shock boundary.
We can do speech gap detection. Because as I mentioned, it's very important to put ads where there's no speech. It's very annoying to watch content and then have an ad put right in the middle of somebody speaking, right in the middle of the sentence. That's very annoying. Then we have a couple of business rules. This depends a little bit on the content. Some content, as I mentioned, should be an ad every 20 minutes. If it's shorter content, it should be maybe that you only want one ad. And it also depends on the producer. They can put restrictions on the amount or the frequency of ads. So with these business rules, we can use that to correctly serve the correct amount of ads. And then we have a couple of other pipelines that help in this process as well.
But again, using the data model, we can combine all the information from all these pipelines to make the choice that makes this a successful application. So in this case, we know we can look at where the shots intersect with the gaps in the speech so we can know where to put these ads. So to put this a little bit more visual, so as I said, shot boundary detection basically looks at color histograms. So in this image, if you look from top to bottom, this is the time axis. And then I placed the horizontal line on where the shot boundary is. So you see in the top left, you see some purple colors, then it switches to more beige colors. And when there's a big switch like this in colors, we know that there's a shot boundary. And the same in the middle. In the top, you see a completely different color distribution, then, on the third image compared to the second one in the middle. So we also know that there is a shot boundary. Same in the last row. And we can actually improve this by also looking at faces, for example, and at a higher level also in a similar way, combined with other information, also detect the scene boundaries. So this scene is comprised of different shots, of course.
All right. And then if we combine all this information, we can actually determine where to place an ad. So what I've tried to visualize here is the speech. So what we do in the speech gap detection is that from the video, we extract the audio. And from the audio channel, we can split the audio into the speech part and the rest part, basically. So this is the music, background sounds, dogs barking, cars driving by, these kinds of things. But basically speech versus the rest. And from this speech channel, we can analyze the intensity, right? So what you see here, the blue line, basically shows you the intensity of the speech. So basically, how loud is the speech at a particular moment in time? And we can create a threshold, or we can set a threshold, I should say. And we're going to say above this threshold, we think this is significant amount of speech, and below this threshold we don't consider it to be speech. And this, if you threshold it, you see this orange line.
We can turn it into a binary signal. So you see that the high blue lines correlate with the binary values of one in this signal. So everywhere where the orange line is one, there is speech. And this gives us the opportunity to look at where large gaps are. So you see those gaps annotated here. So we can say, "Okay, everything that's longer than half a second, for example, we consider that to be a gap, or longer than a second." And this way, we can identify where the speech gaps are. And then using that data model, we can super easily combine and create one view of the data that's the output of shot boundaries, speech gaps, and also the business rules. And what we can do here is then-- I visualized four examples here, of which only the second one is a correct one. But you see the first one, there is a shot boundary. There's a correct shot boundary. But there is speech in there. You see the orange line has a high value. It's one. And then the second line, vertical line, you see that there's a big gap there, and there's also a shot boundary. So this is a good place to place an ad or a candidate to place an ad. And then you have the other two examples. There's a speech gap but there's no shot boundary. And the last example is the same. And combining this information makes it possible from these relatively simple models to combine them to solve these more complex tasks such as these.
All right. So what I tried to show here in this talk is basically our idea around video AI at RTL, our vision, but also that using Pachyderm, we can create and scale these complex artificial intelligence solutions by using simple and small, reusable modules. And Pachyderm gives us a really nice way of doing this.
Sean: Well, Vincent, thanks for that excellent presentation. And we'll get started with today's Q&A session. And I want to thank the audience for their participation. We've had a great many questions that have come in during the presentation, and we'll do our best to get through all of them in the time remaining. And during this Q&A session, I'll leave up this screen with contact information for Vincent if you'd like to contact him for a copy of the slide deck following today's webinar. All right, let's get started. So the first question is what other tools or software did the team evaluate before going with Pachyderm?
Vincent: So there are a couple of tools that we looked at, and there are also other tools that we are using for VideoPipe. So one of these is Argo Workflows. With Argo Workflows, we can also orchestrate and scale these individual units, so these modules. And we can also combine them to basically solve these more complex tasks. But it is a little bit more cumbersome. And we can combine Argo Workflows with DVC, for example, so data-versioning, which basically solves that reproducibility aspect. Yeah, but as I mentioned, the benefit of Pachyderm is that all of these features are high coupled in one product. And a really nice thing, also, I think, is that data is basically treated as a first-class citizen, where adding data to a repository, basically the input repository, makes sure that everything gets triggered downstream in your deck or in your collection of pipelines. And this gives a really nice, traceable way of computing your output data. Again, these things can be solved with other tools as well, but we really like the Pachyderm approach.
Sean: Well, thanks for providing that additional context. The next question is how does this setup transfer to other domains? For example, would a similar setup work for music?
Vincent: Yes, that's a very nice question to start with. I think every complex task that can be broken down into these smaller units or these smaller modules can be solved or can be applied in a similar way that I showed here. So for music analysis, for example, we do music analysis as well in VideoPipe. So this is part of our collection of pipelines where we basically detect the amount of music that's in our content. But you can also imagine that there's another set of pipelines that digs a little bit deeper into the music and detects, for example, what kind of melodies are in there or who is singing or what kind of instruments are playing. And you can recombine this information to solve a lot of other interesting tasks. So yes, I think it transfers nicely to a lot of other domains, actually, and especially domains that deal with high dimensional, complex data, such as video or music.
Sean: Wonderful. Thank you, Vincent. The next question is do you think you would be able to replace human designers and editors at some point?
Vincent: So I think that's always little bit of a dangerous thing to say. But I think some things that humans do can be replaced by machines. So AI is capable of doing things that humans can also do. But I think it's more interesting to think about how we can use AI to help creative people do their job more efficiently or help them become more creative or let them get their creativity to the next level using AI. I think that's more interesting, and also, I think, necessary because especially when you are dealing with data that deals with a lot of creativity, there are parts of creativity that you can not really model with AI at the moment or possibly never, could never model with AI. And you also need that human perspective, I think, to make sure that what you're creating is actually for humans and not for computers or for artificial intelligence. So you always need humans. But sure. There are also tasks, some of them that creative people are probably happy that that is solved by a model. There will always be parts that can be replaced by AI, but you will also need humans.
Sean: Understood. Well Vincent, thank you. Great answers to some very good questions.