Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

Human in the Loop: Building an Ethical and Optimized Stack with Pachyderm and Toloka

Jimmy Whitaker

Chief Scientist of AI @ Pachyderm

Magdalena Konkiewicz

Data Evangelist @ Toloka

Data preparation has always been a tedious and lengthy process for machine learning and artificial intelligence. As teams look to automate this part of the machine learning lifecycle, they must still handle challenges to categorizing and labeling their data. But by using a combination of crowdsourced data labeling and automation, teams can augment their ML capabilities.

Data preparation has always been a tedious and lengthy process for machine learning and artificial intelligence.

As teams look to automate this part of the machine learning lifecycle, they must still handle challenges to categorizing and labeling their data.

But by using a combination of crowdsourced data labeling and automation, teams can augment their ML capabilities.

In this webinar attendees will learn:

What is the Machine Learning Lifecycle
Why it’s important to integrate human oversight into ML
How you can use a combination of automation and human judgement for a winning stack

Webinar Transcript

Chris: Hello, and welcome to another Pachyderm webinar. My name is Chris and I'm on the Pachyderm team. I'm excited to welcome you all today to today's webinar. We have a great session lined up, and before we get started, I'd like to go over a few housekeeping items. If you have any audio or video issues while watching today's webinar, please try refreshing your window or browser. Today's webinar is being recorded, so we'll share the recording after today's webinar. In addition, if you have any questions for the presenters, be sure to use the Q&A widget at the bottom right of your screen. Feel free to submit your questions at any time during today's presentation. If we don't get to your question today, we'll be sure to send it to the presenters and they will respond back. Today's presentation is titled Human in the Loop: Building an Ethical and Optimized Stack with Pachyderm and Toloka. Today I am joined by Jimmy from the Pachyderm team and Magda from the Toloka team. And with that, I'll pass it over to them to begin the presentation.

Jimmy: Thanks, Chris. So yeah, just like Chris said, today we're going to be talking about data prep, and in particular, human-managed data prep with Pachyderm and Toloka. Magdalena and I will be copresenting this, and so we'll be bouncing back and forth kind of throughout the presentation. And first, we're going to do some quick introduction. So I'll tell you a little bit about me. So my name is Jimmy and I'm a machine learning developer and developer advocate at Pachyderm, where we're trying to solve some of the hard data logistics problems as they pertain to machine learning. So I actually joined Pachyderm after leading and working with applied research teams to build NLP and speech recognition models in the financial realm. In particular, tier one financial institutions. And while actually trying to build these models and speech recognition and NLP, we were constantly running into issues that centered on data management and iterative model improvements, and just issues with the scalability of our tooling and processes, in general. So many of the things that we're going to be talking about today are actually things that I've experienced and dealt with first-hand. So we're excited to chat about that. Magdalena, do you want to tell us a little bit about yourself?

Magda: Thanks, Jimmy, for this introduction, and hi, everyone. So my name is Magda and I am a data evangelist at Toloka. My background is in AI. This is what I have studied at Edinburgh University, and then I worked for several years as data scientist. I was mostly involved in NLP and speech processing projects. I have also been mentoring data science students and blogging about data science. So you can find me, guys, on Medium, and I often contribute towards data science and towards AI. Well, I guess this is a short info about myself, and I'm going to give the floor back to Jimmy who will tell us what is the agenda for today.

Preparing Image Data for Machine Learning

Jimmy: Thanks, Magdalena. All right. So today we're going to be talking about human-in-the-loop data preparation and, in particular, how we manage that process. So first on our agenda today, we're going to talk about data prep and why this is hard. And then we're going to move on to human-in-the-loop and how we actually incorporate a human into the data preparation stage and actually put some management around that. Then we're going to talk about kind of an overview of both Toloka and Pachyderm. Then we're going to talk about the integration that's been built and give a demo of that so that you see kind of a hands-on real-world example of how this works in action. And then finally, we're going to close out with some Q&A. So as Chris said earlier, if you have some questions as things come up, then feel free to drop those into the chat, and we'll come to those towards the end there. So with that, let's get started on why is data preparation hard.

So according to this Forbes article, which I think maybe a lot of people have seen - it was from back in about 2016, I think - data scientists spend 80% of their time preparing and managing data sets. Now, this number may have changed a little bit over time, but in my experience, this is still 100% the case. And even if it's not data scientists that are specifically spending their time curating the data sets or managing and preparing the data sets themselves, there are teams still in the organizations that have been specifically hired to curate data sets if it's not the scientists. And personally, from my conversations with customers or even my experience myself, I'd still estimate there's probably about 80% of the effort for a machine learning product still goes into curating the data set and managing that data in some way. And the reason for this is that labeled data is a crucial component and, in some ways, a pillar of the foundation of AI.

In fact, there's three main things that have come together to enable kind of this ML revival and revolution over the past, I guess roughly 10 years or so, 5 to 10 years, that have caused these breakthroughs. And that's been the breakthroughs in availability of algorithms and libraries that allow us to use those algorithms. It's also the fast and scalable hardware that's been not only created by people like NVIDIA and in Google, but also the availability of that hardware via cloud platforms.

Curating Labeled Image Datasets

And the last is the existence of large curated data sets, for example, in the case of ImageData. That was a breakthrough for not only machine learning algorithms, but also ushered in the big data sort of theme and renaissance there recently. But curating data and managing it still seems to be where most of our time is spent. And so what we ask ourselves is why is this the case? And the main reason is that data preparation is really difficult. So for instance, if we look at this example right here, the prompt is really simple, draw bounding boxes around the animals. But when we come to a person or if we try to label this ourselves, we notice the ambiguity that comes with it. So for instance, in the first picture, there is a bounding box around all of the animals in the picture. And so in some ways, that could be considered correct. In the second image, there is a bounding box around each of the animals themselves, so around the full dog in the picture and the full cat in the picture, but we have these nested bounding boxes as well. So that picture of the dog or the box surrounding the dog also contains the box surrounding the cat. So there's kind of a weird situation here where we have nested objects.

Then in the final picture, we do have nonoverlapping bounding boxes but we have parts of the dog that are missing and the cat being separate in that specific example. And so the main thing that we're really trying to illustrate here is trying to figure out which one is correct for this problem is something that even experts would disagree with. And in a lot of cases, it would be a situation where it either depends on our algorithm that we're trying to train, or maybe we would remove this example, or there are a lot of different situations that we might get into to impact our decision here. But we still don't want to-- we still need to know what the right solution is and how we can approach this. But the real key here is that we can't wait for our data to be perfect. So just like in software, no company can wait until they have the perfect software application before they release it. For instance, you would never really release anything in that case, but you have to start somewhere. And this is where the machine learning life cycle comes in, that we're not always going to have this perfect data, and because of that, we need to embrace that and come up with a life cycle and a way to iterate towards a better machine learning model.

Iterating Labeled Data for ML

So the machine learning life cycle is really all about iteration, but the component that often eludes people is that there are actually two life cycles in the machine learning life cycle that have this symbiotic relationship between them. And in this case, it's code and data. In the machine learning life cycle, there's these two life cycles: code and data. And we're constantly iterating on both of them. For instance, we're providing our human understanding of the problem to improve our code by applying new techniques and new model types and everything. And we're also providing our human understanding to our data. And this means that iteration must be a fundamental part of our machine learning processes, not only in the code world, where we're actually iterating on our code but also in the data. And the more we can incorporate the right tooling, for instance, in the case of software development, the more we can actually leverage our AI team's insights and solve real-world problems in an iterative fashion. And so with that, I'm actually going to pause there and hand it over to Magdalena to talk to us a little bit about the human-in-the-loop approach to machine learning.

Human in the Loop Machine Learning

Magda: Thanks, Jimmy. And hello, everyone, again. I would like to talk to you about human-in-the-loop concept, as Jimmy have mentioned, in the context of machine learning. But first, I would like you to have a look at the graph that I have here. It is a simplified version of machine learning production pipeline. And you can see that we start with training phase, then it is followed by a validation step. And normally, we train and validate several models. And once we find one that has satisfying performance, we push it into production. Once the model is in production, we need to monitor it to catch any accuracy deviations. And if the accuracy drops, we need to retrain our model again, so it cycles back to the initial training step. Often, this process is a continuous circle, so it will be executed many times during the lifespan of the product. But I would like you to ask yourself the question here, is this graph complete? And I would say no. The cycle that I have shown you is typical for machine learning projects that power AI products, but it is missing one important component here, which is human involvement. Not many people understand that behind so-called AI products, there is a lot of human effort involved, and especially in the data labeling and data prepping step.

And I would like you to have a look at our graph again, showing a revised pipeline. You can see that it's the same graph showing machine learning pipeline, but I have added a small human icon at every step of this pipeline that represents human annotation. So starting with the training step, we need humans to annotate the data to be able to perform supervised learning. And similarly, in order to evaluate our models, we need more data annotated by humans. Also, while monitoring the model in production, we ideally would need human-labeled data at least to periodically check if our predictions are not deviating. And if they are deviating, we need those new examples to be human-labeled in order to retrain the algorithms again. So as you can see, human annotation is required at every step of this pipeline.

And now, I would like you to have a look at another pipeline that represents example of human-in-the-loop concept but maybe even more traditional one. So on this graph, you can see that human is used to aid an actual machine learning model for difficult cases. And many machine learning algorithms will give us probabilities connected to the predictions they make. And we can choose a threshold probability to filter the cases that the algorithm is finding difficult and send them for human judgment. So that way, the end user will receive a prediction given by the human. And the example can be sent back to the algorithm so it can be retrained again using those difficult cases and get better, basically. So this graph also shows a rather large human effort in maintaining the algorithm accuracy.

And so why do we talk to you so much about human effort involved in the data preparation? This is because Toloka and Pachyderm can solve a lot of problems involving data labeling and data management in the context of machine learning projects. And first of all, I would like to tell you a bit more about Toloka, and Jimmy will tell you a bit more later about Pachyderm. So Toloka is a crowdsourcing data labeling platform that allows labeling of large amounts of data. And in the following slides, I will tell you a bit more about what type of data can be labeled with Toloka, a bit more about crowdsourcing, and the approach we have towards data labeling. So let's get started with some examples of data that can be labeled with Toloka. Here, you can see four different types of data. The first example shows the tasks of video classification. As you can see, the person completing the task can choose a mood of the video that he or she is watching.

And the second task was used for gathering data for a named entity recognition problem, and the user were asked to highlight dates, technical terms, and amounts. And we have also a third example that shows kind of classical object recognition tasks, where a performer is asked to outline objects with polygons. And we have one more example that shows a side-by-side comparison of two photographs where the performer is asked to mark if the two images show the same objects. And I would like to tell you that this is a non-exhaustive list of tasks that can be marked with Toloka. And the reason I show you those examples is to visualize that Toloka can be used in variety of tasks and for different domains. Theoretically, you should be able to perform any type of data labeling because the tool allows you to design your own interface using building blocks and JavaScript.

And now that you saw the examples of what can be done with Toloka, the big question is why. Let me give you a simple answer here. Almost all AI products need constant data labeling. And the more data you have, the better will be your algorithms. And the more accurate data you have, the better will be your algorithms. The quicker you label your data, the better will be your algorithms because you will be able iterate for the machine learning pipeline faster and, therefore, develop and evaluate models faster. And in many machine learning projects, the labeling process is done by people who are explicitly hired to perform those tasks, but this is not the best solution. So this is because the in-house labeling process can be expensive, hard to measure, and unscalable. So imagine that you have to pay the labelers even when there is a fewer data to label. Or maybe the worst-case scenario, you have suddenly much more data to label, and you need to urgently hire people to keep up with the volume of the data you are receiving. And this is not the ideal solution. This is why at Toloka, we use crowdsourcing in the labeling process.

And so some of you who do not know what crowdsourcing is, let me explain it quickly. It's just using a crowd that perform some simple tasks in order to achieve some bigger goal. So in our case, it's creating a good quality data set. This approach is more scalable, easier to measure, and easier to manage. So here on this slide, we have a simplified version of interaction with Toloka platform. Toloka itself is an open tool that can be accessed by both requesters and performers on computers or on their mobile phones. Requesters are machine learning engineers, so data scientists, researchers, and other people that may post tasks that require labeling. Once these tasks are posted, they are available to Tolokers. This is how we call our crowd. And they can perform the tasks at their convenience. They can choose their workload in terms of time and interests. And a requester, on the other hand, is ensured about the timely completion of the task he posted, as there will be always someone from the crowd available to perform a particular request. And talking about the crowd, I would like to show a bit more about the crowd we have here at Toloka.

And this is a slide that represents Tolokers all over the world. As you can see, they speak many languages, and they are distributed in different countries and in different time zones. We have calculated that on average, there are around 50,000 active users daily that are ready to solve different tasks in different places of the world. And the availability of the crowd and the diversity of the crowd allows us to treat the labeling tasks as engineering tasks. So with Toloka, the process can be automated as we provide APIs and Python and Java is the case. This means that it can be integrated into machine learning pipelines easily and the flow of the new data can automatically trigger the labeling process if needed. And this is something that we have been experimenting together with Pachyderm, where we use Toloka as a labeling cluster and we use Pachyderm to store and version data and orchestrate the flow of machine learning pipeline. I'm not going to tell you more about it at the moment, but at the end of this presentation, we will have a demo that demonstrates such a flow of Toloka and Pachyderm working together. And for now, I will just give the floor back to Jimmy who will tell you a bit more about Pachyderm.

Version Control for Labeled Datasets

Jimmy: Thanks, Magdalena. So we heard a lot of amazing things about Toloka and how you can enable your teams to use crowdsourcing as well as actual labelers inside your organization to efficiently curate, label, and even iterate on your data, as Magdalena was hinting at. But what is Pachyderm and how does Pachyderm actually fit into the picture? So let's first talk about what Pachyderm is. So for anybody who doesn't know, Pachyderm is essentially a data versioning and pipelining system for machine learning. And what this does is it provides a data layer and foundation to build and orchestrate your machine learning systems on. So it's kind of this bottom layer that's managing your data and also managing how data is passed between different tools or even workloads that you're trying to do. So more detailed, it's actually built on top of Kubernetes and cloud storage to ensure that it can also scale to any workload, whether that's a data storage size workload or a data processing pipeline workload where you're trying to scale out pipelines in a efficient manner.

So let's dig a little bit deeper as well. So how does it work? So the two key components that enable this are a data versioning system and a pipelining system. The data versioning system efficiently manages your data, no matter what type of data it actually is, whether it's audio, image, video. Pretty much any type of data that you can get from Toloka or another source, it can fit into Pachyderm, and you can work with it. And it's actually able to efficiently manage this by storing things in cloud storage. And then also does some really sophisticated things like deduplicating based on even portions of files, which is pretty, pretty interesting if you're dealing with very, very large files and artifacts inside of the versioning system. The second component is the pipeline system. So Pachyderm pipelines are different from other types of pipeline systems in that they are coupled to your changing data-- or coupled to the data versioning system and how your data is stored. And so this means that you can set pipelines to run automatically whenever your data actually changes.

So for instance, if I label another 1,000 image examples, I can have my data set curation pipeline or my concatenation pipeline or something like that combine or add these new 1,000 examples to my data set that's released, or even kick off a model training process automatically whenever my input data changes. So combined together, this data versioning system and pipeline system allow you to create a fully reproducible system. So this means that we can have a history and a lineage of every change that's actually been made to not only our data but also what code has been combined with that data, and see what downstream effects are impacted because of those things. So with our data versioning and our data pipelines, we can instantly reconstruct any of the past outputs or decisions that were made and understand what data change or pipeline change actually affected that output process. So if we make it a little bit more simpler and look at how Pachyderm and Toloka specifically work together in practice, we can actually see something that's a little bit more akin to this. So Toloka is amazing for editing the data and incorporating crowdsourcing into the process of the data preparation stage. And then what we can do is we can use Pachyderm for not only storing that data but then also using the pipelines to pass our data to the next stages in our process, or pass new data into Toloka for data labeling and for resourcing our crowdsourced labelers to efficiently curate that data for us. So I'm going to pause there and then hand it back to Magdalena to talk about the demo.

Magda: Yeah. Thanks, Jimmy. So as mentioned before, we have prepared a short demo for you of Toloka and Pachyderm. And I will show you a video but, first of all, I want to give you a bit more background about this project. So we have created a pipeline that can be used to annotate clickbait data. And here on the screen, you can see two examples of such type of annotation that needs to be run. So first example is Russia's nomadic reindeer herders face the future. Maybe this is a clickbait. And the other one, Obama promises the world a renewed America. Maybe not a clickbait this one. But basically, we want to annotate this type of examples. And in order to do it, we have set up pipelines within Pachyderm that orchestrate the data labeling flow. And this graph shows the flow that we have created for this project. I don't want to give you too much of the details here, and then maybe the graph is not super well visible as well. But let me give you just an overview what it's showing.

So initially, we have a repo called clickbait data that holds data that can be used to train the machine learning model. We then have several pipelines that manage Toloka, so we can enrich this data set with new examples. And we have a pipeline that creates tasks for Toloka. Actually, the first pipeline creates a project with Toloka. And then we have another pipeline that creates individual tasks. So those are the tasks with the text you have just seen. We have some additional pipelines that create so-called honeypots that help us later with quality control of annotation within Toloka. And then we have an actual pipeline that runs the whole annotation process, which is followed by aggregation pipeline. We need the aggregation pipeline because we give the same task in Toloka to several performers. So if one of them makes a mistake, we can actually figure it out by doing aggregation with other responses.

And once we run the annotation with Toloka, we have a concatenation pipeline that merges all this data together that can be then used for our machine learning model. And what's the most interesting about this process is that every time we add new data-- this is something that Jimmy already mentioned. Every time we add new data, the upstream pipelines are run. So the process is automated and this is what the demo will show you. So I think I can ask Chris to run the demo now and I'll give you a live comment of what's happening there. Let's wait until it buffers a bit, and this is basically showing you the same graph that we've seen on the last slide. I'm just zooming in on different parts of the pipeline so you can see it better.

This is pretty long set of pipelines here. They finish here with the data concatenation, as I said before. And let me just stop it here. But this is the data that we're going to add right now to Toloka or to Pachyderm repo with the command line that you're going to see here on the right-hand side here. We're running the command to add this data to a repo. And have a look what's happening right now. Hopefully, we will be able to see that the process is started here in Pachyderm GUI, and we can see the data is flowing. Yes. So right now, it's on creating pool tasks, so we can actually open a window and see what's happening in Toloka right now. So this is the project that we have created, and this is the pool that has just been created by running the Pachyderm pipeline.

You can see that no one has completed any tasks yet in Toloka here. There are 10 tasks. So this is the Excel file I showed you. Yep. And we can preview to see how actual crowd can see those tasks. So they can see them in batches of five here. Yeah. Just to show you again the examples we have used here. And yeah, I wanted to show you also the instructions. So the people performers who perform this task, this is what they see. And going back to the progress, I actually fast-forwarded this process. It takes a few seconds here, but in reality, I think maybe it took 5 or 10 minutes. And the process of annotation has finished and hopefully, this will trigger the downstream of the pipelines within Pachyderm. And let's see if this is happening.

Yeah, we are still on Toloka wait pool, so we are waiting. Even though we saw that pool has finished but it haven't progressed yet. It did right now, so you can see the data is flowing through the aggregation process right now. And now the last step, concatenation step. You can see here inspect the jobs within Pachyderm, and you can see that all those pipelines here have been run. Yes, so that was the demo part, and this demo has shown you a simple example of the integration of Toloka with Pachyderm. If you would like to set up this pipeline example, it's available on GitHub and we will share a link to it at the end of the presentation. Also, let us know if this type of integration is of interest to you guys. So if there is enough demand, we can work on more complex examples and we could maybe show you a bit more projects of how Pachyderm and Toloka can work together. I know that Jimmy will share our contact details later on, so do not hesitate to contact us and give us some feedback afterwards. That will help us to plan if we could do any further integration together. And I think it's time to summarize this presentation, so I will let the floor back to Jimmy, and he will give us a quick summary of what we have learned today.

Jimmy: Awesome. Thanks, Magdalena. Yeah. Oops, there we go. Got to get the right slide. Yes, so I'm really excited by what we've just shown you. This is a really cool example of at least something I wish that I had when I was working on speech recognition and NLP examples. This type of tooling in an interaction is really cool and incredibly useful. So just to go through a quick summary of what we've gone through today. So essentially, what we know is that good data is the key to unlocking the potential of our machine learning applications-- or our machine learning projects. But the difficulty there is that curating high-quality data is really difficult in that it's an iterative process to get to this nice curated data set that's actually going to impact our models for data that it's going to see in the real world. And this iterative process also relies on humans to be in the loop to provide our intelligence and our knowledge of the domain to data points so that our machine learning models can learn from them.

We also talked about how Toloka is a crowdsourcing and labeling platform that's going to scale with your data and with your labeling workloads, and how Pachyderm is a data versioning and pipelining platform to manage this data as it changes and as it's moving through the processes internal to your machine learning, training, and development processes. And finally, we talked about how together, Toloka and Pachyderm give you a scalable way to not only curate but also manage your data and bring all these moving pieces together so that you can create more reliable and more useful models that actually address the real-world scenarios that your data is coming from. So finally, we're going to go to a Q&A, and actually, we'll leave you on this slide as we're looking at questions. Yeah, so make sure that you reach out to us and go to our communities. Toloka, for example, organizes monthly meet-ups and webinars, as does Pachyderm. And there are different types of events and topics related to data science and machine learning, so make sure you plug into that.

Toloka has also given a promo code for anybody that registers as a requester, and you can use this promo code to get $30 to your account to try out Toloka and try out some of their crowdsourcing solution and everything there. And here's a few details at the top of this slide to actually help you to navigate through that. Definitely, make sure to check out the integration. This is something that's pretty new, so if there's questions about it, again, make sure to reach out to us on our communities or to us directly so that we can make sure that we're building and working on things that are actually useful to the communities. And then if you're interested in Pachyderm, definitely go to our website, where you can try it for free or request a demo, and we can go into more details about how you can use it to solve your problems.

Webinar Q&A

So with that, I'm going to pause there, and I think Chris is going to maybe come back and kind of tell us what questions have come in throughout the process. I think you're muted, perhaps, if you're talking.

Chris: Maybe now, hopefully.

Jimmy: Yes, we can hear you. Thanks.

Chris: Awesome. Always a fun thing when doing this live. So looks like we have a few questions that came in from the audience. First one we have is, do you segregate your crowdsource workers? I imagine something like, "Is this title clickbaity?" is highly subjective to the labeler and will vary based off of culture, age, language, etc.

Magda: Yes. Yes, we segregate our crowd workers. You can use quality settings in Toloka and you can choose the performers according to their location, particular skills, or the languages they speak. So yes, it is possible.

Chris: Great. Great question from the audience, and definitely, keep them coming, folks. We've got plenty of time here. Next question we have from the audience: What kinds of things can you run in a Pachyderm pipeline? Can you run machine learning jobs in them?

Jimmy: Yeah. So Pachyderm pipelines are really general. In essence, everything is running on Kubernetes, so anything that you can put into a Docker image will run in a Pachyderm pipeline. So I've run machine learning jobs or even pulling data in or even serving end points from within Pachyderm pipelines. So there's kind of no limit except for what you can put inside of a Docker container, for example, and what kind of pipeline you can create. Also, any language as well if that's of interest. I know at least in my case with speech recognition stuff, there were some old things written in C that Pachyderm was sort of one of the only ways to scale out that preprocessing without having to dig into some really interesting multiprocessing code.

Chris: Perfect. Awesome. Another question that just came in here: can you show a configuration file/files for an experiment in Pachyderm? I don't know if we can do that. I don't know if you have anything ready, Jimmy.

Jimmy: I'll drop a link to something in the chat for that. Give me one second.

Chris: So we will share that in a bit. In the meantime, another question we have here: The demo and example you guys shared was focused on text, but can Pachyderm and Toloka work with other types of data as well?

Magda: Let me answer this from Toloka perspective. Yes, it can work with different type of data, so it can work with audio, video. There are some special annotation tasks when we actually ask people to go and do some filters, so it can basically work with variety of data types. I'm pretty sure Jimmy will tell you about Pachyderm.

Jimmy: Yeah, absolutely. So I kind of hinted at it during the presentation, but Pachyderm, specifically our data repositories are built at the file level, but the versioning capabilities are a lot deeper in that they're completely agnostic to the file type. So for instance, if you're uploading a bunch of video data and change, I don't know, the last half of the video, you're not going to be duplicating that video a whole other time. Pachyderm is actually really smart in how it versions and manages that data, where it'll deduplicate things intra-file level for large files, and then it also does some really cool stuff. Outside of that, I won't go into all the technical details. I can nerd out on that for a while. But essentially, what that means is the way we've designed Pachyderm is that you can work with any type of file that comes in, so anything from model files that are the output of your training pipelines or video files, image, audio. Even text and structure data can also work in it as well because it's treated like files.

Chris: Awesome, a few more questions we have in the queue. Do I need to have programming skills to run Toloka and Pachyderm?

Magda: So from a Toloka perspective, the examples we've shown you, you actually need a bit of programming skills because all of the code has been written in Python, but there is a version of Toloka that can be used for non-programming people, where basically, you just upload the batch of data that needs to be labeled and you wait for your results. Everything can be done with GUI, so there are different options. Jimmy, how about Pachyderm?

Jimmy: Yeah, I think you're exactly right. It's similar at least. There's plenty of ways you can do it, and yeah, there shouldn't be any issues with that whatsoever.

Chris: Cool. Another question just came in. Can we run more complex crowdsourcing pipelines with Toloka and Pachyderm like if my crowdsourcing project is too complex, I would like to decompose it into a number of different projects where I use the results of the first project, i.e., classification as input data for the next crowdsourcing project?

Magda: Yes, and I think this is actually when the power of Pachyderm and Toloka could be leveraged together. This is an ideal example because Pachyderm can orchestrate the flow of the first project. Basically, we can use the pipeline to gather all the results and then automatically, the second part of the project-- the second project could be triggered using these workflow, so yes, definitely.

Jimmy: Yeah, I think this would be a really interesting case. For instance, the integration that we have now is a fairly simple, kind of straightforward labeling process. But if you wanted to do something more complex, yeah, it should be, I think, pretty easy to abstract what you guys have already-- or what's already been created on the Toloka side to orchestrate a really complex workflow. This doesn't seem too complex, but I think it could do some really cool stuff.

Chris: Awesome. And definitely, if you're interested a little about learning more, definitely check out the GitHub example. Great place to actually get started with the demo right away. Looks like the last question we have if no other questions come in is, do I need any additional settings to run the Pachyderm and Toloka project from demo? I guess any other things that you need for hardware or anything else you guys recommend?

Magda: There is no need for additional hardware, but at least what you have to do from Toloka side, you have to register first as a requester and basically, request a authentication token and put it when you are setting up the project. I think it's all explained in the README file. And I think also to be able to see this wonderful pipeline with Pachyderm, you need GUI's enterprise version, but all of the rest can be done with the basic Pachyderm set-up.

Jimmy: Yeah, that's right. So from the Pachyderm side, because we're deployed on Kubernetes, we can deploy anywhere that Kubernetes is, regardless of what cloud you are. Pretty much everyone has that. Or even if you're on-prem, you can deploy Pachyderm in that setting. For some of the UI components, in particular, like the console view that you saw, that is an enterprise feature, but if you want to try that out, we can give you a trial enterprise key and those types of things. But the core of Pachyderm is also open-source if you want to run the open-source version.

Chris: Awesome. Great questions, folks. It looks like that's all the questions we have. So I just wanted to take a second here to say thanks, Jimmy and Magda, for being able to present today and for a great presentation. Thank you to the audience for joining us for today. And if you're interested in learning more, of course, check out the links we have. Try out the demo and check out the integration we have from Toloka and Pachyderm. And this webinar will be available on demand. You guys should get the recording later on today. Thanks, everyone, for joining, and we'll see you guys at the next webinar.