WEBINAR: Building ML For Delivery With GTRI + Olympia

Webinar Transcript

Introduction

Chris: Hello, and welcome to another Pachyderm webinar. My name’s Chris and I’m excited to welcome you all today. We have a great session lined up. And before we get started, I’d like to go over a few housekeeping items. If you have any video or audio issues while watching today’s webinar, please try refreshing your window or browser. Today’s webinar is being recorded and we will share the recording after the webinar. In addition, if you have any questions for the presenter, be sure to use the Q&A widget at the bottom. And if we don’t get to your question today, feel free to reach out to us on-- reach out to the presenters or post your question in the Pachyderm Slack channel.

Today’s presentation is titled Mind in the Cloud, Heart at the Edge: How GTRI is Taking the Next Step in AI and Delivery. And with that, I’ll hand it over to Austin to start this thing off. So over to you, Austin.

What is GTRI?

Austin: Thanks, Chris. Hey, everyone, I’m Austin Ruth. I am a research engineer too. I’m kind of skipping ahead to my next slide. But I’m going to talk to you kind of about GTRI’s vision for the future. How that includes Pachyderm. How that includes some other tools from the AI Infrastructure Alliance and just some of the stuff that we’re working on and how we will take the next step in AI and delivery and help our sponsors but hopefully help the industry. Which is something GTRI at least hasn’t done a lot in the past. So I’ll talk through some of that too. I wanted to have a splashy title, so I thought Mind in the Cloud, Heart at the Edge, was pretty good. Chris was like, “What are you going to title this thing?” And I sat there, and it just popped into my head. And I thought, “Well, maybe that’ll get people to log on.” So I hope that at least the title was compelling.

So who am I? Like I said, my name is Austin Ruth. I am an AI/ML and modern software engineering researcher at Georgia Tech Research Institute. A little bit of my background, I started out doing system testing. So I was actually a systems engineer for about two years. And I worked primarily in subject matter expertise of separate systems and basically working with the sponsor to say, “Hey, you’re right about this. You’re wrong about this. Your data says this.” And like I said, I did that for about two years. And then, after that two-year period, I dove headfirst into the data analysis of those systems. So that primarily was get data in. Offload the data from hard drive or whatever. Do some different parsing with the data. Do some clean-up. Do some video review, audio review. Kind of the whole data gambit. And at the very end of that, write a report that we could deliver to the sponsor and kind of tell them what the data means and all those things. I got interested in AI/ML because of my master’s degree. I got a master's in computer science from Georgia Tech about two years ago. And I did a lot of interactive intelligence things going through classes, like artificial intelligence, knowledge-based AI, so on and so forth. Machine learning for trading. So I started to take that and apply it to some of the data analysis stuff that we’ve been doing at GTRI.

Working with Pachyderm + Label Studio

Okay. So how did I get here? Why am I rambling in front of you right now? Well, starting with the very beginning. So I was working with Pachyderm and Label Studio, and I noticed that in my pipeline, I didn’t have a real good way to get the data from Label Studio into the pipeline without having to jump through some hoops. And when I say that, I mean I could get the data out. Jimmy Whitaker, if you’ve played with any of the Label Studio stuff, you’ve probably seen his name or chatted with him on the Slack channel. He had a tool that syncs up Pachyderm and Label Studio so that you can actually sync up to your repos. One of the problems was is that Label Studio did not allow at the time, and that syncing did not allow for a user to export the data in the format that they need for their model training to go forward. So what I was having to do, to begin with, was actually export the data manually, put file command into Pachyderm, and then it could process, then it could train a model and output the model and keep going down and down the line. So we wrote a tool called Daffy. And what Daffy does is it hooks up to Label Studio through the API and it’s containerized so that you can set it right after your repo that’s ingesting that Label Studio data.

Now, kind of a little weird thing about Daffy is it used the API. So when data hits the bucket from Label Studio doing that sync, the data that’s in Label Studio is in that normal Label Studio format. And then Daffy just looks at Label Studio through the API, re-downloads the data through the exported format that you want, and then it can obviously go down the pipeline. So after we did all that work, after we published that to GitHub, we wrote a little blog that kind of explained everything. Explained everything I’m talking to you about today. And kind of why we did it and why it was important for us. And that’s kind of how I ended up here. We got some folks in the Slack channel that we’re asking, “Well, what the heck is GTRI doing? Why are they using these tools? Why are they writing open source tools to connect some of these things? Love to hear more about it.” And so that’s how we got here. But yeah, feel free to check out that blog.

The Problem: Manual Data Processes are Everywhere

Okay. So started with the end of why we’re using Pachyderm, or at least how I got in front of your screen. So now I’m going to go back to the very beginning. I mentioned that in my career; I did some systems testing, and we did data analysis, things like that. Obviously, it’s starting to line up. Okay, well, this is why you’re using Pachyderm. You want some automation, blah, blah, blah. Well, in my career, we had a big issue with automation in general. So to give you some stats, at least from some of the work I had in system testing, we had 15 tests. We only had two reports. We didn’t have any feedback from the user. We didn’t have any sort of continuous delivery. And by that, I mean tools or documentation. All of our processing was manual. So I mentioned before that we had little tools that process the data. That was us literally taking the data, putting it into another file, running a Python script, and repeating that down the line. And then, of course, because of that, we had no reproducibility.

This is not just something that affected me. So we’re starting to notice this across our organization that people are doing some of these tests or they’re writing software to process data and they’re still manually moving data. They’re still manually invoking scripts. And they have no concept of delivery of their own tools. But of course, they have no concept of data as a product and delivering that data in a format that their sponsor can use. And a lot of the work that we do at GTRI is actually for the DoD and we’re recognizing the same issues there. I’m sure some of the different company reps here have already played around with some of the DoD contracts and noticed that they have a pretty bad data problem when it comes to not only getting data from point A to point B but even having any form of automation or really any kind of smarts in some cases. And a lot of that’s because, hey, these are legacy systems. They’ve always worked this way. There’s a large impact if you try to go in and change it.

And hopefully that kind of sets up for what I’ll talk about later on in what we’re doing and why we’re doing it. Because we definitely don’t want to have a big impact on operations, but we want to have a large impact on the automation and the ability to do work in the future. In some cases, the system tests that we’re talking about they can cost millions of dollars. And without feedback, without continuous delivery, without reproducibility, we lead ourselves into a hole of not being able to get the data that we need, and then we have to do one of these tests again. And again, that’s a cost of millions of dollars. A lot of time, that’s to the DoD. And I’m sure any of the taxpayers here do not want to have us spending that kind of money when we could fix it through proper data analysis and modern software engineering.

Automating Common Data Tasks

Okay. So I’m not really great at making slides. I’m certainly not good at making slides because I come from a very heavy engineering background where slides have way too much text on them and everybody’s falling asleep. So I tried to make the next couple of slides just have some impact and I’ll talk to everything. So I’ve already explained the problem. We recognize that. So some of the first things we did, we were like, “Okay, we have to automate, right? We’re not going to change any of our processes. We’re not going to do anything other than automate.” And so we actually started with a tool called Luigi. Spotify developed it. I don’t remember when, but we started using that tool to kind of pipeline some of our data and have different processes in place that we could automate the data through. Problem there was is a lot of that was just scripting. So if we had categorical data or if we had documentation, things like that, we could make some changes there and push it down the line. But unfortunately, with things like video and audio, which in these tests were really prevalent, we couldn’t do anything, right? It was still an engineer sitting down, watching video for multiple hours, listening to the audio, and then going back to that categorical data that luckily was automated and start sifting through that to say, “Hey, this matched up. Hey, this didn’t,” and then manually going and writing a report.

So in our minds, the biggest thing we could do is at least automate that process up until that point. And this is another reason that we started to get into AI/ML is we thought, “Well, hey, what if we could automate using computer vision or automatic speech recognition?” Some of these tools that were pretty easily deployed through things like Keras and TensorFlow and PyTorch. And so we started to work into that. Another important aspect of what we were doing, obviously, as I said, was testing. But I don’t want you to think that that’s just a system test. GTRI is handling those system tests. It’s handling regression testing on hardware and software. It’s handling engineer testing in the form of being able to deliver a product at the end of some of these other test events. And as we have grown, as we’ve become more modern, we’ve started to learn, “We don’t even need to just test the hardware. We don’t need to just test the software. We really need to test the data processes and being able to understand what’s going on, what’s affecting our data, so on and so forth. So we have automation. Extremely important. That’s going to speed up our stuff. Get us more deliveries out to the customer. Then we’re going to test. We’re going to set up infrastructure to test everything from the hardware and software all the way down to any generated model or any generated data, anything like that. Just depending on what we want to have working in the end, right?

And then another important thing for us is reproducibility. So in our tests, we have a lot of changes that can occur all the time. There’s so many moving parts that changes at each step in our pipelines could occur daily honestly. And with the work that we do with many different sponsors, a lot of times they come in and they say, “Hey, what if you change this? Can you reproduce everything if you change that?” And up until recently, we kind of had to say, “Yeah, we can, but it’s going to take just as much time as it did before.” And so while we designed very good experiments, and we started to have automation in the loop, it was still really difficult for us to say, “Okay. Let’s go back to this date, click on this, and reproduce everything up until that point.” And now that we’re starting to get a little more modern, now that we’re starting to do a lot of the software engineering that is focused on delivery, we’re starting to do some of this, right? And Pachyderm has actually been really helpful in that. And I’ll talk about that in a little bit.

Okay. So as I was making these slides, I was like, “Okay. I’m going to have 15 slides. These 15 slides are going to have these different dumb little tags on them.” And then I started making them and I was like, “People are going to hate this. They’re going to get tired of me going word to word.” But these are some of the other things that we think are super important for the future, right? Being able to generate data. Probably something that a lot of you are actually working on right now, right? I’m sure there’s somebody that’s using Adversarial Networks or Variational Auto Encoders. Or maybe even just simple simulations, maybe complex simulations - not sensor flow, sorry - like Tesla. But being able to generate data and have that in the loop is extremely important to us. The next thing that’s probably top on our list is monitoring, right? So at GTRI, most of our sponsors don’t leave once we finish a product for them. And ironically, we in the past have delivered-- GTRI’s really good at delivering, but can we provide monitoring for future-- AI/ML models, future agents, whatever we’re deploying, can we provide monitoring so that we can be ready to upgrade and redeploy right then and there?

I’ve just recently finished The Phoenix Project and The Unicorn Project by Gene Kim, and it was really eye-opening to see GTRI was kind of the company in the book where they do great work, they’re known for this great work, but now that we’re getting into the modern world, we need to be able to do things like I mentioned before. We need to be able to have feedback. We need to be able to continuously deliver so we can see where we fail, so on and so forth, right? And a big component of that is being able to monitor the performance of our software in the field. Kind of stumbled over this one already, but we need to be able to improve. And that kind of is, in part, all of these things on the screen, right? Generating data’s going to help us improve the model over time. Monitoring the model over time is going to help us understand what we need to generate or what data we need to fix to improve the model over time. And then deployment, right? If we deploy correctly and continuously, then we don’t have to worry that much because we can catch some of these errors, we can redeploy super quickly, and we can make sure that we have improvements at the field. And again, going back to some of our DoD stuff that we’ve done in the past, being able to deploy and improve is a big deal and it’s something that the DoD has been wanting for a long time.

The Case for Automated, Reproducible Data Loops

To give you an example, if you’re in the army and you have cameras on your JLTVs and you’re scanning for IEDs, you may have a computer vision algorithm that is looking down at the ground, and as you drive, it’s trying to identify what may or may not be known as an IED hiding spot or just an IED in general. Well, if you start to learn new things, it doesn’t matter if that model is wonderful if there are new tactics that you have to deploy against. And so one of the things that GTRI, and this is kind of bringing it all back, Mind in the Cloud, Heart at the Edge kind of thing, is we want to be able to take three stages of deployment into our ecosystem. One, is we need to be able to deploy our software in the cloud so that data analysis can constantly be happening, right? We can generate new data, we can monitor the model, and we can deploy it to the cloud to continue that process of improving the model. But we need to then take that model and put it out in the field, right? We need to give it to the edge and we need to let it actually work, right? And so our question to ourselves has been, “Well, how much processing can we do at the edge? Can we deploy Pachyderm and label studio?” And as the soldiers in JLTVs are going down the road, could they be labeling data? Could they uncover IEDs and say, “Yeah, go back a couple of steps. Yeah, this was an IED.” Can they improve, monitor, and deploy the model right there?

And so we’re working on different ways to actually deploy Pachyderm and some of these tools at the very smallest part. And I actually have-- something like this, right? We want to have the ability to, hey, maybe we can run like a little baby version of Kubernetes K3s or MicroK8s on Raspberry Pi and have Pachyderm just doing some light stuff, right? Maybe all it’s doing is predictions and then it’s feeding data back to the users for them to label. Or maybe you have something a little bit beefier in the backseat where it’s a fog node and it’s doing a little bit of training to make that model a little bit better and then it’s deploying it back to the vehicle’s camera system or whatever. So that kind of gives you an idea of the full loop of at least these four things and then why the last two that I talked about are so important. If we can automate those steps - if we can test those steps automatically, and then we can do these four things - we can give the DoD, but really anybody else, the capability to have models at the edge get a lot of very rich data. And then a lot of the troubles that you have with things, like labeling data, gathering data, kind of solve themselves because you’re using some of these modern techniques. Again, like Generative Adversarial Networks or active learning or something like that.

Better DataOps for Better Data Science

So I kind of talked through all of this stuff already, but I wanted to give y’all a GTRI kind of look at where we’ve gone and where we’re headed. So we started stabbing in a solution. A couple of slides ago, I talked about the problem we had. Freaking 15 tests and only 2 reports, right? And to be honest, that’s inexcusable, right? We need to have better ops so that each test has its own report and each new test can benefit from the tests in the past, right? Again, another reason that we’re very interested in Pachyderm is that we can do something like that. We can dig through the data. We have that data sitting there and we can manage it so that we can say, “Hey, this happened in the past. Okay. What did it look like? Or this happened in this test. What did it look like for these other tests? How do these line-up?” So on and so forth.

So I already mentioned Luigi. We started with that. We had a lot of data analysis. We started piping data to different spots. Well, then we started to ask ourselves, “Hey, can we use machine learning?” So I can’t dig into any of the details, but we started to use just simple stuff like decision trees, right? Made it really easy. Took our categorical data. Fed it through. And we actually started to automate just a couple of the things. A couple of the analysis pieces inside Luigi with those decision trees. So we made custom scripts that fit in Luigi, and then the decision trees would get trained. And they work fairly well, is what I’ll say. So, expanding on that, we actually started to work with a tool called Airflow. I’m sure most of y’all are familiar with it. And we did basically the same thing. We used Airflow, and we used its UI to allow the customer to actually input their own data, create their own operators, and contribute to their own data analysis automation.

And then, moving on, we started to actively use Computer Vision. And with the video that we had from these tests, we started real simple. We used convoluted neural nets to do classification. And what we did was we had screens that a video recorder is looking at and we would just identify which screens are up at that frame and how does that correlate through a timestamp with the categorical data that we have. And so that correlation gave us a ton of answers and, at least to an extent, made it where I didn’t have to sit down and actually watch that video. Everything’s not perfect. I’m not saying we had 100% accuracy and we could back it up. Realistically, we’re talking about 85% accuracy on some of these things. But we were able to at least have a little bit of confidence to say, “Hey, we’re getting most of the answers that we need. Maybe Austin only needs to two times X through the video and just make sure there’s no anomalies that he would expect the computer vision algorithm couldn’t get.”

Now, that did develop into more advanced stuff. So we started to do actually some genetic algorithms and building computational photography algorithms that did hyper resolution for images. And then we could do different extraction methods, template matching, things like that. Because we had this super rich hyper resolute photo that we used other photos to train the GA algorithm up to that nice resolution that we needed. And then, kind of getting off of this path of automation and Computer Vision, we started using GitLab more and more. And you may be like, “Wait, Austin, were you not doing change management or anything like that?” We were. We were, but we weren’t using proper CI/CD methods to deliver some of this stuff. So once we started using GitLab, we started to deliver containers for everything, right? So we had all those Computer Vision algorithms I mentioned were in containers. We had the decision tree stuff all in containers so we could just pass that input data as we needed to. And we actually did start to use GitLab as a replacement for Airflow, at least in the intermediate time, because it was pretty simple to write these pipelines inside GitLab. And then we were able to use our high-performance computers as runners to process data for us and bring it back. And so we had this elaborate system of bots, basically, that would commit to repo. And those repos reflected certain data steps so you could actually dig down into a certain repo and get the answers that you wanted. And then kind of fast forwarding to today now what we’re using is we’re using Pachyderm. We’re using Label Studio. We’re still using GitLab for our CI/CD to deliver the containers. But then, once those containers are delivered, we can use them in Pachyderm to process our data. And I’ll talk a little bit more about that in a little bit.

GTRI's Vision for Mature DataOps

Okay. So I wanted to just kind of talk about the vision. I know that when there was a chat in the Slack channel; we had some folks asking about the vision of GTRI. I’m going to give you more of the vision that I have and more of the vision that my team specifically has, which is all baked in modern software engineering. So we have two main focuses. DataOps. So, like I mentioned, most of the work we’re doing at GTRI always funnels down to the data. We have some folks that are focused purely on hardware. They don’t mess with the data. But the largest extent of GTRI is working with data. And so what do we need to do? Well, we need to focus on gathering data. And we’ve started to use some lean methodology, discovering and framing different feedback methodologies that we can go into a sponsor, we can set up these interviews, we can get rich data from them. We use tools like Lucid chart and, oh, I can’t think of what the other one is. Another sticky note software. And we gather all that data. We actually make a diagram for the customer. We come back with that diagram and say, “Hey, tell us where we were right. Tell us where we were wrong.” We kind of keep that loop going. So instead of just gathering data that the customer wants to give us, we start to understand the process and we understand what the data means and why they need the answers that they need.

Next thing is managing events. And I kind of have a dual purpose here. So in our team, we actually manage things like hackathons and stuff like that to educate people on some of these modern software engineering practices. But other things that helps us do is gather more data, right? Get an understanding of what our customer base is comfortable doing, what they can do to contribute to the stuff that we’re doing. And, in general, everyone’s getting smarter now. On the DoD side, we have tons of people that are asking us to do these things and to teach them modern software engineering. And even on the industry side as well, we’re researching a lot of these things so we can help the industry say, “Hey, here’s the newest stuff.” Querying data. Oh, sorry, going back to managing events. The other thing is actually managing the events of the data. So if you have a system test, I want to see metadata. I want to see everything that went into creating the event for that test and then ensuring that that metadata makes it all the way down the line. So that when a new test comes in, I can ask, “Hey, this happened at this time. These subsystems were on. Has that happened ever before? If it has, what was the result, then?” So on and so forth.

And that goes right into the querying, right? Being able to ask those questions of our data is super important. And we can’t get there unless we are using a modern approach to storing some of this data. So right now, we’re actively looking into GraphQL and Neo4j and some of those tools, TinkerPop stuff like that, to manage some of this stuff, but we’ve got to make sure it’s automated, right? We don’t want to have any sort of human in the loop to mess with the data until it’s time for them to do their analysis and their querying. And then finally productizing. Again, I think GTRI in the past has had an issue with this but were definitely getting better. But we will start delivering the data that we create out of our customer data back to the customer so they can have a rich experience with some of these things. So it’s important to us that we not only productize our software, but the data that goes with it as well.

And then on the other side of things, we’re really focused on MLOps. And this is pretty obvious stuff, right? We need to train a model, we need to test the model, we need to deploy the model, and then we need to monitor it so we can do that whole loop again. We are extremely focused, though, on active learning and giving our operators in the field or in a control room or whatever it is, the opportunity to inject themselves into the data, right? A lot of cases-- I’m actually presenting in Philadelphia this week or this coming week, and I’ve been trying to find a great data set of rusty bridges to show some of these professional engineers. It is impossible to find enough rusty bridges to train a good computer vision model. So I’ve had to enact active learning and I’ve had to use my professional engineer buddy to actually label the data after I continually have trained it as he gives me more data. And we do this back and forth, back and forth. We’re trying to get a better model out of the small subset of data. The issue is in the DoD or in some of our industry partners, they’re getting a lot of data, they’re getting a ton of data, but they’re not actively labeling that. So it either falls on GTRI and our team, to label the information, and we’re certainly not subject matter experts in a lot of the cases, or we have to provide them with an easy way to label their own data and see the benefits of that. Their own monitoring towards the performance and the improvement of their model.

Bringing it Together: CI/CD on the Edge

So I talked about the cloud and the edge. And our main focus at GTRI is enabling the sensor. So I already talked about something like a Raspberry Pi at the edge or something at the cloud, but we want to connect all of these. We want to be able to fly a quadcopter around, gather data directly at the sensor. And then maybe on platform, we have a very small version of Kubernetes or maybe we have a fog node that’s receiving the data and is doing a little bit of training per event. But then maybe we actually have that data again, then pushed up to the cloud, and we can do more of a deep analysis. We can do extended training. We can do generative algorithms to build more data to train, train, train, and then deliver that back to the system. But this little triangle is everything that we’re focused on. We want to be able to say, “At the sensor, we’re going to have the smartest thing deployed at all times.” And like I said, that’s what’s important to us is the sensor is our point of ingress for data. For the DoD, there’s hundreds and hundreds of sensors that are importing data all the time. Can we have a filter at each of those sensors using AI and ML to give us more rich data on the back end, but also so we don’t have to do a lot of the heavy lifting once we get data recorded or something like that?

Okay. So I’ve finally, finally kind of gotten here and I can start to talk about why we’re using Pachyderm. You’ve probably already figured it out, right? We went through Luigi. We went through Airflow. We have actually tried things like Kubeflow and a few other tools. But as of right now, we’re using Pachyderm because of the ability to do both of our goals. We can do data ops with Pachyderm and we can do MLOps with Pachyderm. A lot of people ask me, “Well, why don’t you just continue to use Airflow?” And it’s like, “Well, Airflow’s kind of hard to push the data from point A to point B.” If I can just throw it into an S3 bucket equivalent with MinIO on the back end and then it goes through the pipe and does everything I want, then I’ve succeeded. That’s a really big deal to me and to my customers. Another portion of it is just the rich UI, right? Like a lot of people in the DoD, they are not going to be looking at the command prompt. They’re not going to be looking at code, they don’t care that you have blocks and diagrams of stuff unless they can actually see and click on stuff and see the data and what’s happening to it over time.

Another point is data provenance. As I mentioned before, we need to be able to query data, understand events and go back and have snapshots of those events. We actually have very concurrent data packages that we have to provide to our sponsor. And so if they come back and they say, “Hey, what happened on this event? And after you give me that answer, ship me all the data because I need to do this and this, or I need to upgrade my software so I need data from this event,” we can do that with Pachyderm and we can take that snapshot and send it back to them. And another thing is, and I will say this, I know Chris will get a chuckle out of this, but I’m definitely a Pachyderm fanboy. Love the logo. Love a lot of the stuff. But another big thing that kind of pushed me with Pachyderm is the community. To be honest, in a lot of the stuff, like Kubeflow and some of these things, it can be difficult to get onto a community of people that can help you with those problems. But with Pachyderm, I mean, honestly, even if you’re using the Community Edition or something like that, you can actually log in to their Slack channel and they’re going to help you out. I don’t know that I’ve ever waited more than an hour for an answer. Even sometimes burning the midnight oil, I’ll have somebody on there answering my questions, which is really important, especially in this modern age.

We’re wanting to do the same thing. We’re wanting to give our customers the capability to give us feedback immediately. And funny enough, it’s kind of different at GTRI to do something like that. When I talk to folks, they’re like, “Well, how did you get a hold of those Pachyderm guys?” I was like, “I messaged them on Slack and they got back to me in 20 minutes.” And they’re like, “Really? Really?” You didn’t have to go through all these vendor channels?" And I’m like, “No. No. Good community. They answer the questions.” And like I said, that’s where we want to get. It started with things like Daffy creating the open source. If anyone has any questions on that right now, feel free to message me in the Pachyderm Slack. That’s probably the easiest place to talk about it. Because if I have any questions for them, we can ask the folks there as well. But Label Studio’s doing this too, right? They have a pretty good community that you can sit down with and get answers from the questions. So that’s another reason why we’re using that tool along with the fact that Label Studio works so well with Pachyderm and that AI Infrastructure Alliance has such a tight-knit together. But that’s kind of the reason. Obviously, like I said, fanboy. So we’re using Pachyderm because, as of right now, it’s been the best tool for us to deploy the infrastructure that we want to give to the user. And we’ve been testing it at different levels. We’ve got it on our cloud infrastructure. We’ve got it on what we’re calling our little fog node computer. We just deploy it through Minikube. We’ve got all the services working. So it’s a big deal for us that it has that ease of deployability. Pop a Helm chart in. Bada bing bada boom. It’s done. So that’s a big deal for us.

All right. So finally, kind of last thing I’ll talk about. Our goal at GTRI, our vision, is to get better at delivery. What is that? I don’t know. We have so many teams at GTRI working with so many different sponsors from the DoD, from industry. We’ve worked with companies like Delta. We work with the State of Georgia a lot. We work with our campus affiliate, which is the Georgia Tech Institute-- jeez, Louise, I went there, and I should know this. Anyway. Georgia Institute of Technology. Sorry, folks. But we work with them as well. And our goal, again, my team’s goal, is to make ourselves better at delivery, whether it be data, whether it be software, whether it be hardware, and ensure that we can continuously upgrade our sponsors. And give them the data that they need, give them the software that they need, and improve it over time. Get their feedback, so on and so forth. Really dig into discovery and framing, lean start-up methodology, and having a lot of the answers for those customers and being able to iterate that over time. And so I just kind of put a little handmade word cloud of some of the stuff that we’re working through. But yeah, that’s our big goal, is to deliver. And we are using DataOps and MLOps as kind of the champion for showing why delivery is important and why our other teams need to really move forward with some of these ideologies and methodologies. And that is it. If anyone has any questions? I think I saw some in the chat. Wasn’t sure. But yeah, if anyone has any questions or anything like that, let me know.

Q&A

Chris: Cool. Hey, thanks so much Austin. And thanks for really presenting here. There's a few questions here. I'm probably going to try to combine them. And I've got a few questions myself actually, so. Well, hey, I guess just start things off:

How did your team figure out how to use Pachyderm? How did you guys identify Pachyderm as a potential solution? I guess, and even when you have it now implemented, what are some of the big features that you think are important for GTRI?

Yeah. So we stumbled upon Pachyderm because, like I mentioned, we were using freaking GitLab. And we started a really small effort to build our own pipeline infrastructure. And one of our engineers, he’s a freaking LinkedIn guru. Does his due diligence like crazy. And one day he messages me and he’s like, "Have you heard of the AI Infrastructure Alliance?" And I was like, "No. What's that?" [laughter] He sends me the link. And I’m being dead serious. I think I clicked on Pachyderm because I like the logo so much. And I clicked on it, and I was like, "Have you seen this?" And he was like, "Yeah. I was just messaging these guys on their Slack channel. I think we should try this out." And so we started using it. We originally deployed it locally through all the stuff. And really the features that we liked the most were the S3 bucket ideology with MinIO's the back end. That was a big deal for us because we’re moving most everything to Kubernetes. We liked it was built on Kubernetes, so it’s super easy to deploy. If you’ve ever used Airflow or Luigi, the maintenance on that is really hard. As long as you have an infrastructure engineer, it’s super easy to use Pachyderm. And then the other feature, and this is the thing that I’ve had to beat into my team, is the fact that it’s all based around the containers. That’s probably the biggest deal to us. Is because we have teams of software engineers. They need to learn how to deliver software. And a big part of that is getting it to containers so that Pachyderm can do its thing, right? I’ve got students right now, and day one I came in and I was like, "You are going to become modern software engineers. And the way I’ll know that your software works is if it runs through Pachyderm."

Chris: You mentioned your customers the DOD. I guess walk us through a typical use case of creating, deploying, and then monitoring your models. I guess, what does that typically look like? What is the life cycle from beginning to end? And I guess even who is involved in that sort of that whole process as well?

Austin: So I’ll give you the traditional answer and then I’ll give you the answer that we’re working on literally right now. I’m going off somewhere next week to work on this. But traditionally the life cycle has been basically, sponsor comes in with an ask. A GTRI team takes it on. It could be anywhere from six months to five years, sometimes even longer. And at the very end, they deliver something. And sure, there’s monthly status reports. Normally just spending and stuff like that. But in general, it’s, "Hey, five years has passed. Here’s what you ask for. Zero feedback, any of that stuff." We actually do have projects now that are still very long-standing. But our DoD sponsors have been like, "No, we’re going to be in this process and we’re going to make sure that we’re online." And then what we’re doing now, and we’ve really kind of been hard-headed about this, is when the sponsor asks us to do something, we say, "That’s awesome. We’re going to do discovering and then framing first." So there is at least a two-week discovering and framing event that we hold with the sponsor. And we gather all their requirements, all their pain points. We have happy hours. We do all the stuff to get them super excited about their problem. And then we go back home, and the first thing we do is we set up communications channel. So Slack or Mattermost. Things like that. And then the lifecycle is legitimately building out a software solution or building out a data solution. Delivering that capability to them. In our case, we use something like PlatformONE on the DoD side. And then, of course, we can just use open source channel for our industry partners. We’ll deliver containers or we’ll deliver documentation, whatever it may be, and we’ll source feedback from that. And then our team will take that. They’ll do another one-week sprint, and then they’ll deliver again. The most successful we’ve been with that I’ll actually mention is some of the Computer Vision work that we did where our DoD sponsor, they wanted involved. So we actually work directly on GitLab with them, and we would deliver containers so that they could actually go in and test it. And so we had this really rich communication channel back and forth of their developers actually committing and merging into our tools and then being able to say, like, "Hey, can y'all help us on this? Can you make the performance of the model better?" So on and so forth. So we’re going from a linear life cycle to a somewhat human in the loop cycle, to a full-on modern, hopefully, day-to-day cycle. 10 deploys a day kind of thing.

Chris: That’s really interesting. So you mentioned DoD is a customer. I guess, what other customers does GTRI work with? And also, how complex are there data requirements and how does it ultimately affect your DAGs and your infrastructure in terms of meeting those requirements? I know every single project it's one-off. It might be different. But I guess, are there any trends or I guess anything that you think is a good takeaway there?

Definitely trends. So the one thing I forgot to mention in the slides is the eventual true product coming out of GTRI is something generic that our sponsors can deploy and then build out those DAGs themselves, right? We want to find what the commonality is, even if it’s super small. And we want to provide them with the infrastructure that includes Pachyderm, includes Label Studio, and what they need to do those things. So from the DoD side, it’s generally pretty heavy, right? So they’re event-based, more or less. And I’ll give you kind of a [inaudible] example going back to the IED Infrastructure. They will hold test events and these test events will gather data. Obviously, synthetic test event data. But they’ll drive out there. They’ll have what looks like IEDs on the ground. They’ll go through all this testing. They’ll have the cameras gather data. Or they may not have cameras yet. They may just be driving down the road and guys are like pointing out, "Here." They’re writing logs down. That data is recorded. And then the data has to be either decoded or processed, however it was recorded. And then when that goes down, it obviously has to be cleaned. And then the clean data needs to be labeled. And then the labeled data needs to go into a training platform. So on and so forth. So from the DoD perspective, that’s one step is the test. And then you have operations. And so once that test is done, you need to also be able to pull in operational data and then do the same thing. And so in terms of Pachyderm, you could have probably 10 to 50 different pipeline steps that you need to successfully map out an entire test event. That doesn’t even include the operations and actually syncing data in to bring it to our platform or something like that. So hopefully that answers that question. [laughter]

Chris: No, I think it definitely does. Yeah. Always curious to see how big and complex people’s data sets are. And I think that example, the IED one, that’s a perfect example of just how complicated things can just suddenly spin out of control there. So I really appreciate that answer.

Austin: I’ll add one quick more thing from the industry side. The industry's a little different because obviously, they’re way faster. And a great example of this is, think if you had a lawn care company. You may want to start to deploy quadcopters so that all you have to do is have a dude drive up to the neighborhood and then he runs a little code. Says, "Time, date, name, and then deploy." And all the little quadcopters go out and they scan the lawns and they say, "Hey, this lawn has this many weeds. This lawn has this many weeds. This lawn has bare spots." So on and so forth. All that data's gathered centrally and then pushed up to Pachyderm. So in the industry case, because they’re doing these things so quickly, they may have less pipeline steps. But the good thing is that they’re continuously gathering data. And so their loop may literally only look like data and event comes in. Parse the event. Put the data here. Put the data into the model. Model trains. Deploy the model all the way back to those quadcopters. So it is a little interesting to kind of make that distinction between the DoD and these really large events, and to the DoD where-- or, I'm sorry. To the industry, where it may be super specific problem sets that have to be small. They’re going to have much smaller Pachyderm instances in that case.

Chris: Okay. Okay. Got it. That makes total sense. So we got 10 minutes left. Probably have time for maybe two or three more questions. And let me see through the chat. What are some of the benefits, I guess, have you seen through implementing Pachyderm? You already mentioned with airflow and going through Luigi and some of the other sort of big technical evolutions. I guess, really, why Pachyderm? And what are some of the big benefits that you've seen since your implementation?

Austin: The two biggest benefits, I’ll say, is the data backed by MinIO on top of Kubernetes. So basically, being able to deploy our infrastructure with those tools through an ansible script or something like that is a big deal because we break it a lot. And if anyone goes back through Austin Ruth’s history and help, I’m breaking things a lot of the time. And most of the time, that’s me overwhelming the system that something like Pachyderm is on. But the fact that my infrastructure engineer can snap his fingers and basically redeploy everything I need, that’s a big deal. And then the MinIO back end, kind of a 0.5 on that is, when we destroy stuff, we don’t have to worry about our data going away, [laughter] Which is a really big deal, especially in Pachyderm because for our deployments right now in Pachyderm, we've written little baby Infrastructure as Code scripts that, once he redeploys Pachyderm, I hit a button, and my pipeline is up and running. And because I have that MinIO back end, I can just bring that data back into the forefront and I don’t have to worry too much about loss of data. And then the other thing is, and I know this is kind of a bleeding-heart sort of thing, but a big thing has been the community. So I think I mentioned we used Kubeflow. Well, we started running into some different issues with Kubeflow that we had to solve. And so it’s a student sitting there banging their head against the table, wondering what’s going on. And I remember even saying to one of my students-- he was like, "I can’t figure this out." And I was like, “See, if it was Pachyderm, I'd just send them a message in Slack." And again, that’s a big deal. But really, the other technical thing that we see is benefit from Pachyderm is the ability to easily deploy the data. So I think I already mentioned this before, but kind of the S3 methodology of the data is really important to us because most of our customers like the ability to interact with some of that stuff. A lot of these guys are really stubborn. They want to have that interaction. So with Pachyderm and being able to use the API or the UI or whatever, add data to these buckets, but then also being able to disconnect from that and inject directly into the buckets or look into the buckets and stuff like that is really important to us.

Chris: Got it. Cool. [The last question in the chat looks like it was kind of answered through a previous statement?]. To kind of wrap things up, I guess, what's the roadmap now for GTRI? And what's the roadmap for you? What are your future goals for your projects there and what do you want to see out of Pachyderm next?

Austin: Our goal is really to take a lot of these open source tools and package them together and give a really rich experience for our customers. And kind of to answer that question, the things we want to see out of Pachyderm more than anything is aiding at least in our ability to deploy that to the customers. And I talked to some of the sales team and that’s definitely something that’s in Pachyderm's mind is not only being able to support our usage of Pachyderm but also being able to support, "Hey, guys. We got a customer that wants to use Pachyderm as a part of our platform that we’re bringing to them." And so being able to have that communication and have those sales in place is really big. From a technical standpoint, the thing I’d love to see from Pachyderm is more integration. And this may exist and I may just not be-- I may not be privy to it yet, but model deployment infrastructure would be super cool. And maybe even some built-in infrastructure of the golden Pachyderm, if that makes sense. So Label Studio. Something like ModelDB. MinIO. Being able to one-click install all of those things from Pachyderm would actually be really, really neat. And not to say that all the features need to exist, all those things, but if I could tell a student, "Hey, hit deploy on Pachyderm through the Helm chart," and magically, the MinIO persistent volume stuff is already set up for them, that’s a big win for us.

Chris: Yeah. After this recording sends, we’ll send it over to the product team and we’ll see if we can make some magic happen there. It looks like we're just about out of time, I guess. Austin, thank you so much for presenting today. I guess, any final thoughts? Anything you think anyone here in the audience should think about for next?

Austin: I will recommend a book, actually. Well, two books. I mentioned that I'd just recently finished The Phoenix Project and the Unicorn Project. And if you are on this webinar, you care about those books. I promise you. If you are even considering using Pachyderm or if you’re already using Pachyderm, you care about The Phoenix Project and The Unicorn project because it’s going to help you with things like technical debt and understanding more about how you may not have as much agility as you think. And I know it was kind of an eye opener for me reading those books, so I definitely recommend those. And yeah, I’m always in the freaking Pachyderm chat. If anyone wants to ask me any other questions, feel free to DM me there. At me in the general. I’ll definitely respond. And thanks, Chris. Thanks for having me.

ChrisL Yeah, of course. Well, add that everyone to your summer reading books. Austin's going to start a book club next week probably, or we'll do something fun like that later on. I think we're at time. But hey, just wanted to say thanks everyone for joining today. Thanks, Austin, for a wonderful presentation. Of course, everything is being recorded. So, after this, you'll be able to reregister and see the recording there. I'll also upload the video to YouTube as well to share it with the rest of the folks. But thanks, everyone. Thanks, Austin, for your time. And we'll see you for the next one.

Thanks, guys. See you.

Request a Demo