Chief Scientist of AI @ Pachyderm
For teams looking to access and run complex ML workloads on top of their data warehouse stack, they must often build their own custom infrastructure. In this webinar we will explore how to create and automate multi-stage, language-agnostic, data pipelines in Snowflake with complete reproducibility and provenance. We will demonstrate a ML model for churn prediction analysis as the basis for the webinar.
Join us as we cover:
For teams looking to access and run complex ML workloads on top of their data warehouse stack, they must often build their own custom infrastructure. In this webinar we will explore how to create and automate multi-stage, language-agnostic, data pipelines in Snowflake with complete reproducibility and provenance. We will demonstrate a ML model for churn prediction analysis as the basis for the webinar.
Join us as we cover:
While producing a highly performant ML model for a given task can be desirable, making this the primary focus of the ML architecture leads to pigeon holed solutions incapable of scaling for other desired use cases. Applying a modern software engineering (MSE) approach to ML development, focus can be directed towards improving the reliability, repeatability, and structure of the MLOps pipeline infrastructure rather than narrowly focusing on producing an ML model output.
Chris Hey, folks. Welcome. And welcome to another Pachyderm webinar. My name is Chris. I'm excited to welcome you all today to talk about everything about Snowflake and Pachyderm. We have a great session lined up. And before I get started, I'd like to go over a few housekeeping items. If you have any video or audio issues while watching today's webinar, be sure to try to refresh your window or browser. Today's webinar is being recorded, so we will share the recording after webinar. In addition, throughout this presentation, if you have questions or anything you want to mention, be sure to use the Q&A widget at the bottom of your screen or that chat.
And if we don't get to your questions today, feel free to reach out to us, to the presenters or to our Pachyderm Slack community. Today's presentation is titled How to Run Data Pipelines for ML using Snowflake and Pachyderm. And with that, I'll pass it over to Jimmy to introduce himself and to get started.
Jimmy Great, thanks so much, Chris. Well, hello, everyone, and welcome to another webinar. Like Chis said, my name is Jimmy. I am Chief Scientist of AI at Pachyderm. And today, we're going to be talking about Pachyderm and Snowflake. So specifically, we're going to be focusing on running data pipelines to train machine learning models with Pachyderm, and that data is actually going to be coming from Snowflake. So first off, we're going to talk about the agenda. So first, quick overview-- the first thing we're going to do is do a quick overview of Snowflake, talking about what it is and why people like using it and kind of some of the details about Snowflake that make it good. Then we're going to move on to some of the cases where Snowflake kind of begins to melt and some of the issues that start to arise when working with data that exists in data warehouses. And then that's actually where Pachyderm comes in.
So we're going to start talking a little bit from there about what Pachyderm is, what it does, a little bit how it works. And then finally, we're going to move to exploring a working example of Snowflake, doing a churn analysis for a music streaming service. So with that, let's go ahead and dive in.
Jimmy So let's first talk about Snowflake. So Snowflake hit the scene in about 2014, right around there. And it's been recognized kind of across the industry as a game changer. But why? So there have been databases before. There have been storage mechanisms before. But what makes Snowflake, in particular, so different? And in a nutshell, what that is, what actually has allowed Snowflake to succeed is that it's brought cloud computing to data warehousing. And it has made that a mainstream sort of thing. So really, what this means is what they've tried to do or what they've-- the intention and the approach has been for Snowflake to unlock the value of the data that you have in your data warehouse and actually do this by making this a cloud-based approach so that you can actually build your data applications on top and kind of leverage the value of your data.
Because its main objective is to unlock the value of your data, Snowflake does this by breaking down data silos and helping you build applications on top of your data. And it does this kind of with three main features. So the first feature is that Snowflake is cloud agnostic. So snowflakes is based in the cloud, and it doesn't specifically matter which main cloud provider you have, whether it's Amazon or Azure or GCP. You don't really have to worry. You can deploy it in any of those places. So the good thing is this means you don't have to worry about upfront costs. So you can actually try things out at a fairly low cost. You can increase performance when you need to because the complexity of managing the system can kind of grow like kind of cloud computing can. You can start with something small, and you can scale it up, or you can figure out how you want to grow and scale things up as you go.
Jimmy The second thing here is performance and speed. And this is where Snowflake has kind of really carved out a differentiator. So Snowflake itself kind of was built for performance, and especially for analytic-type queries. And because it's cloud-based, it can enjoy the kind of elastic nature of the cloud, which is really great. So if you want to load data faster or run higher volume queries, you can scale up your virtual warehouse and take advantage of kind of these extra computer resources. And so because of this, it's-- or one of the things that they've built to this end, rather, is a SQL query engine that allows developers to take advantage of the elasticity and scalability that can be offered in the cloud environment. And this can allow you to be cost-effective.
So if you're running a lot of queries that cost a lot of money, what you can do is you can limit the amount of computation you throw at it, and that can make things more cost-efficient. So that's this third point here. So like all cloud services, you can have less administration overhead because you don't need to manage the hardware.
But the real value of Snowflake comes in in that it separates out that storage and compute. So that means if you have that really expensive query that you need to run on your data, you can actually control how much it costs by trading off that speed at which it runs with the cost you want to actually-- of the hardware that you actually want to throw at it. So with the goal of unlocking the value of your data, this is a really, really good start. This is why people really like Snowflake and why it's being chosen kind of across the board for a lot of things or at least data warehouses in general. But with anything, any tool that's built specifically for some certain use cases, there can be some downsides. And so there are a few things that Snowflake doesn't do very well. For example, the first thing is that it kind of is structured only for structure data and not necessarily for unstructured data processing. So Snowflake itself doesn't really support unstructured data, except through what's called scoped URLs. Typically, this means you're referencing an object sort of outside of your data warehouse and the only way to really process this unstructured data using Snowflake is kind of with-- if anybody's familiar with Java UDFs or something like that, you can actually process data using that.
Jimmy But there's a lot of problems with this because data scientists typically don't have access to Java-- or sorry, don't use Java consistently. They may have access to Java. And writing UDFs in general can be kind of unmanageable. They're not super easy to write, not very straightforward. And this unstructured data in particular or the lack of support for unstructured data-- also, if anybody doesn't know, unstructured data's kind of like audio, video, images, text, so chat or email or anything like that.
These kinds of things can be pretty difficult for Snowflake to manage because it has to treat them-- some of the structured data are references to things inside of its format. And working with unstructured data, unfortunately, is really the business of machine learning. So for instance, if I'm training a model, not only is my model a particular type of file that's unstructured data, but it's also going to consume unstructured data files as well. So for instance, my training data might be in a particular format, like a CSV or Parquet, and those can come from structured data sources. But ultimately, they're files at the end of the day, and working with files is very much the mechanism for machine learning and how we typically interact with machine learning.
Jimmy The second thing here is programming language support. So Snowflake itself supports pipelines through Snowpipe and has a couple of other things that you can do there, but really the limitations are kind of centered around SQL, query-only operations. For instance, in order to process data in a pipeline with something other than SQL like a programming language like Python or R, what you need to do is you actually need to get the data out of the warehouse, typically, and into an environment and a format where a programming language like Python can actually manipulate it.
There have been a few tools that have been created here. For instance, there's Python libraries and other types of things to export that data to, say, a CSV file or Parquet file. But ultimately, this means that you're actually moving data out of the data warehouse so that you can, then, work with it in a different system. Specifically, you're also having to write SQL code to perform kind of-- or if you're operating inside the data warehouse, you're having to write SQL code to perform kind of this iterative exploratory data analysis and data science, which can be really good. But then once you actually want to productionize that or run that consistently, then these experiments can get really lengthy, hard to read, and even harder to reproduce if you're doing things outside of your data warehouse in a non-scalable, not repeatable way.
Jimmy The last thing that we have here is version control. So Snowflake does have something that it calls Time Travel. And this allows you to access historical data at any point within a defined period. But the biggest drawback here is that the retention period itself caps off at 90 days, and after that, it's placed in what's called a fail-safe state, which is kind of a seven-day period. So during this, you can actually recover historical data by Snowflake, but if some documentation says that the fail-safe is kind of a best effort basis and not really intended-- and it's only intended for use when kind of all other recovery options have been attempted.
So in general, what this means, this version control aspect, is there are a couple of things, like a couple of niceties to snapshot the status of your data inside of Snowflake, but ultimately, it's not kind of a data-centric view of the world. Version control isn't a core component. And this isn't necessarily a problem with Snowflake. In fact, a lot of these things aren't necessarily problems. They're just limitations of the architecture and the approach. But some of these things are hugely useful features because, frankly, storing data is a hard problem, and even programming language support and supporting unstructured data, these are all very hard problems. But when you start considering the things like MLOps and, for instance, your retention period isn't enough or something like that, you need to know exactly what datasets were used. You probably need to apply programming languages. And maybe, you're combining that with some unstructured data as well or turning that structured data into unstructured data that can be used to make the decisions that you wanted to.
And so while all these things aren't necessarily problems with Snowflake, just limitations of the architecture, combining these limitations with something that does fill these gaps can actually make a really nice kind of symbiotic relationship and make things a lot better. And of course, that's how we actually get to Pachyderm and how Pachyderm fills those gaps.
Jimmy So with all that said, how does Pachyderm actually fill in some of these gaps? So Pachyderm itself - we'll talk a little bit about what it is - It's a platform to deal with kind of the hard problems. In particular, it's built for any data, including unstructured data, and it's built to work with that data from any programming language. And also, it can autoscale to any size. So some of these features we've seen in the structured data land of Snowflake, but in the unstructured data world, these features actually are pretty complimentary to Snowflake in some really nice ways.
So in summary, this slide, really Pachyderm is focused on kind of this any data, any language, any scale, and that's how Pachyderm is trying to fill these gaps or in particular, is a nice component to fill some of these gaps in Snowflake. So let's next talk about-- if anybody is not familiar with Pachyderm, we're going to talk a little bit about how we actually do this and how it connects to Snowflake. So at a high level, Pachyderm itself combines a scalable data versioning system with data-driven pipelines. And this actually is really nice because the way that we combine these things is it gives you a full reproducible lineage for any of the data operations or data and pipeline operations that happened inside of your pipeline, your directed acyclic graph environment inside of Pachyderm. And this itself is very useful inside of machine learning context. But the platform itself is also general enough to work for almost any data engineer's needs, whether you're just curating a dataset and trying to release that, whether you're combining some unstructured data with structured data and wanting to release that, or wanting to automate some processes that are going to be useful. So on the data versioning side, we kind of have a few major benefits.
Jimmy So the first one is that the Pachyderm data versioning system is built on top of object storage and has a git-like structure. So the main thing is any commits-- so for instance, we have a-- we'll see this in action in a little bit. We have data repositories that have the git-like structure, which means that when we put data into that data repository, it's treated as commits. And so this actually enables us to not only identify when new files were added or when files were changed but also be able to go back in history if we want to revert our data repository to an older state. So for instance, if we're putting a CSV file in there and the CSV file is changing consistently, then we may want to go back to a previous version of that. Secondly-- and we'll see why in a second, but the data versioning platform is also built on top of object storage, and so this means we can actually scale up to pretty much a limitless size because object storage is based on that. And then furthermore, it's file-based.
So instead of actually relying on rows and columns inside of a database, we actually treat things as files as our core object inside of the Pachyderm file system. So that means if you're pulling in data, you can be pulling in files. If you're pulling in models to actually do something with a model and apply a model, you can do that. Or if you're working with images, audio, video, whatever else, then you can actually pull those things in as well and just treat them as kind of the native objects that they come in, and it's your code's responsibility to handle those objects as they come in. So on the pipeline front-- so this is actually the way that Pachyderm allows you to apply this any-language approach to your data.
So first off, Pachyderm is based on Kubernetes, and this Kubernetes approach means that we can containerize our pipelines or containerize our code so that these can be deployed and actually connected to-- how we connect them is connect them to our data repositories so that anytime our data changes, then Kubernetes knows we can apply the docker container.
Jimmy So basically, you don't have to worry about having everyone on your team use the exact same programming language. If your data engineers use Scala and Java, that's totally fine. If your data scientists use Python, that's totally fine as well because your team isn't constrained to having to work with a particular language or even a particular framework. Because where language and framework are agnostic, Pachyderm relies on docker containers, and this enables engineers to choose the best framework and tool for a specific, say, stage in a pipeline or a specific task that needs to be automated.
The other thing is Pachyderm itself also approaches pipelines in a particular way, where they're data-driven. You don't actually have to manually kick off a process, because our pipelines are consistent and sequential. And whenever Pachyderm recognizes the data's changed, it will actually notify the pipelines that, "Hey, my data versions have changed, and we need to re-run some processing." So this means that, for instance, if we're changing our SQL-- or sorry, if we're changing data in our database, maybe that data can be pulled in automatically, and any time that data changes, our pipelines are automated and can automatically kick off all the processes that are connected to it.
So this means, each time new data is detected, Pachyderm can deploy a docker container containing our code, process that piece of data, and then write that data to an output, or even put that somewhere else, or put it back into your data warehouse or something like that.
Jimmy The last thing that we have is the distributed nature. And so we kind of talked about this with regards to Snowflake, where you get to take advantage of all of the cloud compute's capabilities of auto-scaling to jobs or providing kind of a limited amount of resources so that your jobs don't actually take a whole lot of time to run or don't take up a lot of money in the process of actually trying to execute. And because we distribute our computation through Kubernetes, this means that we can also parallelize workloads and distribute large compute jobs across an entire cluster of resources as well. And not only this, but we can do this across files. And this kind of stitches everything together so that we can have a reproducible lineage of everything that's happening. There's also some niceties with incremental processing, so because we know the difference between datasets, we can only process what's new if that's a need that you have. So there's some really nice things that you get when you combine a version file system with a pipelining system that's data-aware.
And the last thing that Pachyderm does - and I'll be pretty brief on this - is that you also get a data lineage. So this is kind of an immutable record for all the different things that have happened inside of your processing. So for instance, in the machine learning context, everyone really needs to know, with confidence really, that a model or workflow result can be traced back to its original source. So for instance, if we're-- like we'll do in a second. If we're training a churn model on our data and trying to predict, okay, who's our most likely candidates for churning in this next quarter, and how do we approach solving the problem of churn, or how do we actually keep them, retain those customers? Then what we would need to do, really, is we probably want to update this pretty consistently. We may want to look at the new data that's come in. And so anytime we produce a model, we really want to track not only the versions of the code that produce the result but also the data. And so this helps us maintain compliance and reproducibility and even manage the relationships between our datasets so we can provide a complete account of every step in the journey. And this last component here is kind of how we do that.
Jimmy So global IDs, in some systems, you have to chase a bunch of commit hashes and these types of things. But if you have a model output or a dataset, you can actually look up confidently and see, with a single identifier, everything that happened, everything that was processed, or everything that was connected to that version of the dataset. So some really nice ways to kind of search and work through and figure out where your code or your data changed to affect the outcome.
And finally-- and I maybe should have started with this slide, but this is kind of where Pachyderm fits into the overall architecture. So Pachyderm sits at that layer kind of above your cloud resources, and it's tightly coupled with Kubernetes and object storage. So our data management system is built on top of this object storage, and our Pachyderm pipeline system is built on top of Kubernetes, and these two kind of work together. And so when you put these two elements together, you actually get kind of this foundation for data inside of the machine learning or data engineering paradigm, especially when you need version control over your data or when you need to apply some type of processing or reproducible processing to this. So this gives you a flexibility to build all kinds of pipelines using any language, and it's a really general framework for really doing a lot of different things. I've used it for a lot of things in my past, in particular on speech recognition and wrangling data from transcripts and aligning that to audio files and lots of other things that we've done in there. So if you want to be able to transform your data from, say, a pre-processed dataset or whatever else and then allow that to be collaborated on by data scientists and then produce kind of an end artifact or prediction on your data or model or something like that, then Pachyderm's a really nice thing to allow you to have this kind of full lineage of everything that's happened inside of your system.
Jimmy 00:19:55.373 And even if something breaks, you can go back and fix it because you know exactly what data changed or what code changed to affect your overall outcome. So now we're going to move more towards the-- we've talked a lot about the abstraction and kind of the philosophy behind some of these things.
But now let's actually look at a concrete example. And the concrete example, we've kind of alluded to throughout this presentation, is going to be customer churn. So customer churn or predicting churn in general, it's a task to determine whether a user of your system or of a subscription will actually renew their subscription or not to that particular service. So for instance, if we imagine a music streaming service like, say, Spotify, then we could say that the context for churn would be, what's the likelihood that an individual user will renew their subscription, say, like the next month or something like that? And churn prediction is determining who's at risk of not renewing. So for example, it can help tremendously when we're trying to figure out what promotions to run or how to retain customers on that front or how to better engage with the user base or what features are successful or are not successful, many other things. But really we're trying to determine who is at risk of churning. And then we can dig deeper into figuring out what's the cause or what action should be taken. Well, really the whole purpose of determining or predicting customer churn is really to get ahead of the problem. So you're trying to be proactive and prevent it from happening. And when we're trying to determine this cause-- or maybe not the cause of the churn. But when we're trying to predict churn or who is likely to churn, it's a pretty difficult task because we typically have to have a lot of data and a lot of users. And we are trying to usually apply machine learning to figure out what features are most indicative of that user churning. And then we apply the result of a machine learning model on those features to predict churn itself and then start approaching the problem by looking at the root causes for those things.
Jimmy 00:22:02.051 The good thing is, for instance, in the case of our music streaming service, we typically have a lot of information about the user. So how often do they use the platform? Is their subscription an auto-renewed subscription? How is their usage trending over time, or what's the price of competing services? A lot of these things can be really useful features for actually figuring out how a customer is doing or what a customer is doing and if they're likely to churn. And not only that, these features are probably going to change over time. And so we'll start to see pretty quickly why data versioning is crucial, not only in just producing a model off of known data but as our data changes, as our features change, as our columns and our data changes over time. We probably want to know what actually impacted that model and how we can reproduce that in a consistent and reliable way. So specifically, we're going to kind of lean in on this music service churn analysis problem with Pachyderm and Snowflake. So in this example, we're going to use-- it's a KKBox Churn Prediction Challenge. And so this is from a Kaggle dataset or Kaggle challenge, and it shows a real-word setup for predicting churn. So this was a real music service, I believe out of the Netherlands, but I could be wrong there. And what we're going to do here is we're actually going to have all the data that we would typically expect to be in our data warehouse. So in this case, the members, the user logs, and the transactions tables are going to be inside of Snowflake. And then we also end up creating a train database. So this is going to tell us a historical record of these users actually did churn. Or true or false, these users churned by this date. And so we can actually use that data collectively to pre-process it and actually do some feature engineering and train a model and then apply that model to new data that's coming in. So this is what we're calling, in the upper right-hand corner, our prediction data.
Jimmy 00:24:02.165 So this is going to be kind of new data that's coming in for, say, the past month. And then we can actually predict on whether or not there's a good chance that user is going to churn. And then this churn data is going to go back into our Snowflake cluster because maybe we built an application to know, "Okay, once I have the predictions for all of our customers that may or may not churn, we'd probably want to look at what are some of the root causes or what things we can maybe target." Maybe target a price cut for them or an incentive, some type of incentive-based thing to get them to continue their service. And so in this demo, we're going to be using Pachyderm to kind of access the impact of churn or to actually train a churn model and address kind of the impact and the model performance. And then we're actually going to create kind of a multi-step language-agnostic pipeline to do all of this. So first, what we're going to end up doing is we're going to ingest data from Snowflake to Pachyderm. And we'll walk through all of this inside of a notebook. We're going to use Pachyderm to build these language-agnostic pipelines that perform these pretty complex data operations. And then we're going to make some predictions on the data that we want to infer on-- or with the data that we want to do inference on, sorry. And then once we make those predictions, we're going to push those back to Snowflake. And so these are kind of the key things that we're going to see in this demo. And I'll switch over to that now if I can figure out how to share my screen. All right. There we go. [silence]
Jimmy 00:25:50.249 All right. So hopefully, people can see my screen. Let me know if that's not the case. But what we have here is the-- so this full example, it's on GitHub. So if you want to walk through it, feel free. We'll share the slides as well as a link to this at the end if you don't want to go to this right now. But we're just going to be walking through, essentially, this Snowflake Python notebook. And so all the details are here if you want to go into more detail and understand everything that's go going on and see everything. So because this is real-world data, I'm actually going to be doing most of-- I'm actually going to be doing most of kind of my analysis or most of my explanation by walking through this notebook. But because it's real-world data, we're dealing with quite a bit of data. So this processing can take a while, which doesn't usually demo well. So we're kind of going to walk through a deployed-- I ran this yesterday, but a deployed version of the full Snowflake or the Snowflake example, the churn analysis example inside of Pachyderm. And we'll just kind of walk through and talk through some of the components. So the first thing I'm going to do is I'm going to talk about what things I have up here. So this is the Pachyderm console. So this shows me everything that I have deployed, and we're going to represent data repositories. So where I'm storing data and making commits in these kind of purple cubes. And then our data pipelines are going to be these sort of kind of bluish, greenish spheres. And so as we're kind of walking through this, we'll be able to see data that exists in these and then what type of processing we've done inside of these boxes here. The other thing I'm going to have here is most people are familiar with Jupyter Notebooks. I'm going to be walking through this Jupyter Notebook. Make that a little bit bigger. And I have my example cloned here. So I'm using everything that is available to you guys inside of that GitHub repository.
Jimmy 00:27:47.855 But the other cool thing that we have here-- and this is just more of a shot of-- you'll see me use it a couple times, but this is our mount extension, or Pachyderm mount extension. And I'll be able to mount these data repositories that exist that we'll see over here and kind of explore the data inside of them. So right now, I have my training data mounted, and we'll look at that here in a second. So first, I want to start off-- I'm just going to briefly go over the setup of what we have here. Essentially, you'll need a Snowflake cluster, and what you're going to end up doing is where you're going to kind of want to be able to establish a connection with that cluster, and there's a few details here about this setup and ensuring that your credentials are correct and everything. And then what we actually do is we're going to set up these databases inside of Snowflake. Now, I've already gone through this process of creating-- there's a few helper functions here, but essentially, creating these members table, the train table, the transactions table, and users log table. So if I go over to Snowflake, I can see that those four tables exist as well as this predictions table. And that's where my data is going to end up landing from the predictions of my model. And I can even look at the preview of my data. So this training table has-- this is, essentially, a user identifier and the results of did they churn or not. So if we scroll far enough, we'll probably see some ones and some zeros. This is, essentially, my labeled dataset that I'm going use to train my model. And then we have some other things. And like we said, we have quite a large amount of data in here, so this is about 392 million rows. And this is all the user logs. So on certain dates, how many total seconds did they listen to of certain song? What's kind of the top few songs that they listened to, or what's the count of the top 25 or top 50 songs that they listened to? And those kinds of things. So we have a lot of data about our users here. And now we want to actually make use of it.
Jimmy 00:29:44.551 So once we have our tables created in Snowflake, we're then going to-- this is actually where the Snowflake Pachyderm integration comes in. So what this integration is going to do is it's going to execute a SQL query, and then it's going to create a CSV file for the result of that SQL query. And then finally, it's going to take that result, and it's going to commit it to Pachyderm inside of a version data repository. So this data warehouse integration, it's going to let us also do this on a regular cadence. In particular, how we have this set up right now is we're going to have it run every 24 hours, but we can think-- and we can also manually kick off jobs as well. But this kind of real-world scenario where we want to automate this or maybe run this kind of same-setup analysis over a period of time can be really beneficial and really useful. So this is a-- I'll skip over a couple things here, and it'll make sense in a second. So the first thing that we're going to do is we're going to build the SQL query. So what I'm doing here is I'm actually kind of doing an interjoin across all these tables. Basically, I want all my features. I'm going to do some feature analysis and everything. But I just kind of want to concatenate all those tables on the user ID so that I have all that data to work with. So I'm doing an interjoin across the members, so I have all the information about my members. I have all the information about their transactions. And this is a fairly expensive query, and I have all the training data, the labels for that data, as well as the user logs for all of these folks. So this is how I'm going to actually create my training dataset, and then we're going to do some feature analysis and stuff on there. And then secondly, we're gong to do a similar type of thing for our inference data or the data that we want to predict on. And the only difference here is that we have our training labels in the case of our training data, but we don't have any training labels. We just have the raw data for our inference data.
Jimmy 00:31:40.688 And so what the integration actually does-- so our Pachyderm integration with Snowflake really just creates two pipelines. And we use something called jsonnet pipeline specifications to do this. So if you want some more information, we can talk about that, or I can answer some questions in the Q&A about that one if people are interested. But the first data-- or the first pipeline is going to-- this is going to be our train data imports, our training data. Yeah, Snowflake import. But the first pipeline is going to be a cron tick. So this is just going to take every-- or basically create a commit inside this data repository every time something-- or on that cadence. I think we have 24 hours right now. So every 24 hours, there will be a new cron tick here, and that will actually - because Pachyderm's data driven - will automatically kick off everything downstream with this. And so this is actually going to write out our query to our training data query's pipeline. So actually, I can just go over here and show you what that query looks like. For instance, if I go to my training data queries, I can, then, go look at that and see it wrote this one file, and this is kind of the full SQL query that's actually going to produce my dataset. So basically, the output of this query's file is my SQL query. That's all that pipeline is doing, is getting a cron tick and then outputting my queries to this training data, queries data repository. So we have a version history of what query was actually run to produce our dataset. And then the second pipeline is going to actually run that query and then write a CSV output. In this case, it's going to be our training data that's going to get written out. So here, if I go back to here, I can see I have my training data. And I don't want to open the full file, but if I do this-- whoops, look at the head of that file, then I can see that I have kind of the user ID and then looks like some date information and a variety of other features here.
Jimmy 00:33:44.066 So that's kind of how the overall integration-- or the ingestion part of the integration works, is that we have this kind of cron tick that happens, and then we're executing this query and writing a CSV file as the output. So we've basically turned our structure data from Snowflake into a CSV file. And so now we can start working with it from our Python code or R code or whatever else we want to write. Also, it's probably worth noting, these Pythons themselves have actually been written in Go. And so you don't even really need to know about that, because these are kind of Python templates that we're being able to use. So all we actually need to do to execute it or when we created it is we're going to create this pipeline, and we pass it, our specification. This is a jsonnet specification. And then we just pass it some arguments. So this is going to be the name of our data ingestion or our data repository where we're going to ingest our data to. We need a URL, so this is going to be kind of the location of where our data is. We're also going to have our query itself, so this is quite a long line because I'm doing a pretty expensive interjoin and being very verbose about what things should map to what. And then we have our cron spec, so we're going to have it run every 24 hours. And then finally, so we don't have to put our password in plain sight, we have kind of the Snowflake secret. So what we can do is we can create a Kubernetes secret that completely hides that from everyone else but allows our pipelines to be able to use the secret when they're running so that we can actually access our data.
Jimmy 00:35:11.548 So next, this is actually if I want to manually kick this off, I can just manually tell it to do a cron tick. This will just make it commit to that data repository, the cron data repository up here, and just kick off that data so we don't have to actually wait the full amount of time. And then we're actually going to do pretty much the exact same thing. So we're running this exact same jsonnet pipeline template. And we're going to provide it just with a few different things, our prediction data, and then we have a different query that's happening in here. But everything else pretty much runs the same. So if I go over to my console, I can see that yeah, I have these data repositories. I can see the last time that there was a commit to it, so these are the cron ticks on both of these. And then I can also look at the pipelines and see what's actually running in here. So I can see that this was the SQL query, and this is the command that's running the image that actually was being used to run this transformation. So this is the actual docker image. So if I ever wanted to reproduce anything that's happening, this is the entry point with that docker container. So we have kind of a fully reproducible environment, and this is how the data lineage builds up inside of Pachyderm.
Jimmy 00:36:24.019 So next, we can see we have these data queries repository. I can view the files and see that this is my query file. I don't think I can preview this file in here right now with the console. But we just looked at that file over here inside of the mounted data repository. And then finally, we have all our training data in here. And so what we're going to do is we're now going to create a clean data-- or what we've done is we've created a clean data pipeline step. So what this means or what's actually going on in here is I have a docker image that I built, and then we have a Python entry point. And so usually, what my workflow looks like in this is I would get to this point. I would ingest the data or maybe get all of my data or the data that I think I want to actually work with into my training data repository. And then I would actually go over to my notebook environment, and I would probably mount that data in. And I would start maybe working with a notebook or doing some type of data science processing and everything. Once I get to a state where I know the processes and the steps that I want to actually run to clean my data, then I'm probably going to create usually just a Python file that I can wrap up in a Docker container that's going to, then, run my data cleaning stuff. So here, I can see I'm dropping some rows with missing data, I'm replacing some values with another value or have some functions to do that, if there's some issues or kind of impute a column value, and then this may not be the cleanest Python code here, so apologies there. But essentially, this is going to be-- I'm going to have a couple of things where basically, I'm reading my data from the CSV file, and I'm writing some data to an output CSV file.
Jimmy 00:38:03.910 So when that actually goes into Pachyderm, this clean data process is going to just run this Python file. So I can test that Python file wherever, know that it's doing my data cleaning. I can deploy this, and then this is actually going to write the output data to a special place in Pachyderm called /pfs/out, and anything that's written into this file location inside of Pachyderm will get written to an output data repository with the same name. So I can see on the output, I have the CSV file. This might be a big one, so I may not want to preview it, but this output CSV file that actually has all my training data from a model that's cleaned. And so similarly, we have a feature engineering pipeline. So this is going to take that clean data. It's going to, then, engineer some features that will allow me to, then, run another pipeline that's going to model my data. And so this is going to actually output my actual trained logistic regression model in this case, and this logistic regression model will be used to actually predict on my data.
Jimmy 00:39:10.914 So if I go back to my notebook, let's look at how some of these pipelines actually kind of look and get created. So we knew that these pipelines, we have a specification, this Snowflake import, jsonnet file. That's our specification. That allows us to import data, and Pachyderm has kind of already built that for you. But how do these other pipelines work? So really all a Pachyderm pipeline is is a JSON or YAML specification, where we're just giving it some data. So basically, what's happening here is we have this clean data. We just saw our clean data pipeline, a description if we would like it there. So we're going to clean and curate our dataset. And then this is actually where we define how these pipelines should be organized and connected. So here, I'm saying my input is this trained data repository, and there's a special thing here called a glob pattern. We won't go into too many details here about this, but essentially, this means that all of the data inside of training data, whenever it changes, should be kind of mapped into my docker container when it runs. And so when this docker container runs, we can tell it, "Here is the image that needs to run, and then here is the Python command that will run the Python file that's inside of my docker container." And then I just use command line arguments to specify where things are coming from. So here, because Pachyderm is file-based, I'm going to say that my data is going to be in /pfs. So /pfs is Pachyderm file system. It's a special place inside of Pachyderm where our inputs and output are mounted. So this trained data repository gets mapped to /pfs/traineddata.
Jimmy 00:40:46.787 And I mentioned the name of the file here that I'm looking for. And then my output is going to be written, like we just talked about a minute ago, to /pfs/out, and this will get committed to an output data repository whenever this pipeline has finished running. So basically, anything that's written into /pfs/out will be versioned inside of the Pachyderm file system. And you know that this was my original data. This is how I process the data-- sorry, this is how I process the data, and then this is where the output was going to go. And so you have that full lineage of everything that happened. So once we have all of those pipeline specifications-- here, I'm just executing all these. I'm creating a pipeline and passing the file for my data cleaning pipeline and my feature engineering pipeline, even my model pipeline. And feel free to click around inside of the Pachyderm data repository-- or sorry, the Git repository for all these things to see kind of how these things look. For instance, if I open this, I can see that yeah, there's my clean data pipelines back there. So the next thing that we want to do is we want to-- this is kind of where we're moving from. We've done a lot of things. We've actually trained our model, in this case. Now we want to predict churn for users. So how our Snowflake integration works is that-- we also have this other component called Egress. And so Egress is going to take the data that we predicted-- we haven't deployed our prediction pipeline yet. At least, we haven't talked about it yet. But ultimately, how it works is that we're going to have a Snowflake table that we're going to put our results or we're going to inject our results into or we're going to insert the rows of our predictions into. So this is me just using the Python SQL kind of command line here-- SQL library here. Oh, sorry, the Snowflake SQL library here to execute the SQL where I'm going to create the table if it doesn't exist. I've already done this, and the table is named predictions. And then here, I'm going to create all the pipelines that are needed for my inference pipeline.
Jimmy 00:42:52.726 So this is going to be my clean data for my predictions, my feature engineering for my predictions, and then finally, my actual prediction pipeline. So that's going to be following my prediction data. Here's that pipeline here. And it's pretty much the exact same code, just with a different file that we're pointing at. Same here for this feature engineering pipeline. And then finally, my prediction pipeline on the output. And so I can see here, I have my Python predict.py file, grabbing my model. So here, I have two inputs that are coming in. I have my feature engineering predictions as well as my model data repository. So both of these inputs are coming together. And then once I pull those together, I can output my predictions to the output data repository. And we can look at that CSV file-- oh, sorry, folder and then CSV file inside of here. So the other component of this predictions pipeline is the Egress that we talked about. So similar to our ingestion phase, our Egress is actually in the pipeline spec. And what it does is it has a URL. Basically, we're saying we're going to Egress anything that we wrote to this output data repository-- we want to push that back into Snowflake. So here, we're going to give it pretty much the exact same URL that we had previously. Oh, sorry. It is the exact same URL we had previously because we want this to go into our churn example project-- or yeah, database. And then we want to actually use the same compute warehouse and some other things, some Snowflake-specific things. And then we just tell it we want it to be a CSV. And then we also give it the names of the columns here just to be verbose. And we need that secret again. And so here's that secret that we're just going to pass around to our pipeline, and it kind of gets handled in a secure manner.
Jimmy 00:44:47.940 And also have a few other things here. These are just a couple of things that I needed for the cluster that I was using. But basically, I was working with quite a bit of data. Because it's Kubernetes, I can actually request particular resources. Here, for example, I'm saying I need 32 gigabytes of memory and eight CPUs at least to run this pipeline Egress because it's quite a lot of data. And then finally, once that pipeline completes-- I can see that this prediction pipeline have a previous job that was created five hours ago and ran successfully. And I can actually go back over to my predictions table over here and see that I have 171,000 rows that I've predicted on that have just been injected into Snowflake from the predictions job that i had inside of Pachyderm. And so yeah, so that pretty much sums it up. I know there are a lot of gory details in there, but I tend to be-- I want you to fully kind of understand all the different moving pieces and everything. So hopefully, that was helpful and not too boring during the process. So yeah, to just kind of summarize, so we started with data that existed inside of Snowflake, these tables, we ingested those into Pachyderm using the Pachyderm jsonnet pipeline specification, so the Pachyderm Snowflake integration. Once we had that data inside of Pachyderm, all that data is getting versioned, say, in our case, every 24 hours. Anytime that data is re-ingested, Pachyderm will have a new version of that data and will automatically kick off all the downstream pipelines. So for instance, in this case-- and we won't wait to this fully runs, but if I run a cron tick on my data queries pipeline, we'll automatically see that kicks off the full query process. So it may take a little while to actually run this query and everything because it's quite a bit of data and some inner joins. But I didn't do anything except create a new cron tick up here, and that automatically told this pipeline to run. And then when this data repository output changes, this training data will run or will basically rerun that query and ingest that data. Once this has finished, my clean data will know, "Oh, okay, there's some new data I need to process." And then all the downstream pipelines will run. And finally, it will inject my predictions back into Snowflake once that's complete.
Jimmy 00:47:14.255 So yes, we've built this pipeline. And once it runs once, this full thing is automated so that anytime any of our data changes-- or even if our code changes, if we update one of these pipelines, then Pachyderm is kind of data aware as well as code aware. So as soon as our docker container changes or our entry point changes, then it will know that these other jobs are out of date and they need to rerun, so. So yeah, so basically, you get these these great things that we talked about earlier, this data versioning, this kind of data-- sorry, language-agnostic processing, and even some scalability. We won't go into the scalability conversation in this example, but there are some ways that you can split out your data and do some parallel processing on that if you need to do that. So with that, I'm going to go back to our slides. Ah, yes, we just went through the demo. And that pretty much brings us to the conclusion of this webinar, so everything that we showed you today is on GitHub for you to try out yourself, so. Or better yet, if you want to understand more, book a time for a demo so you can go into more detail about the integration if you have some questions. I know I went through a lot of details, so I'm sure there are lots of questions about it. But feel free to reach out with us. We can do a tailored demo for you and have a technical discussion there. And with that, I think this brings us pretty much to the end of our webinar. We'll share out the slides and resources. But for now, I think we're going to open it up for some Q&A.
Chris 00:49:00.144 Cool. Thanks, Jimmy. And hey, folks, if you have questions, we've got some time here, looks like the last 15 minutes. So feel free to use the Q&A, just post your question there. I'm also linked to the blog with the announcement for the Snowflake release as well into the chat, so feel free to check that out as well.
Jimmy 00:50:36.869 Thanks, everybody. Take care.