Pachyderm is a team of passionate individuals who love all things data, open source, and ML/AI. Oh, and we also love infrastructure tools and building developer communities!
Pachyderm is the data foundation for machine learning. Pachyderm provides industry leading data versioning, pipelines and lineage that allow data science teams to automate the machine learning lifecycle and optimize their machine learning operations (MLOps). With investment from Benchmark, Microsoft M12, and others, Pachyderm offers a user-deployed Pachyderm Enterprise Edition, a hosted SaaS Pachyderm Hub Edition and an open source Pachyderm Community Edition. Pachyderm helps customers get their ML and AI projects to market faster, lower data processing and storage costs, and supports strict data governance requirements through data driven automation, petabyte scalability and end-to-end reproducibility.
What would data analytics infrastructure (namely Hadoop) look like if we rebuilt it from scratch today? We think it would be containerized, modular, and easy enough for a single person to use while still being scalable enough for a whole company. Tools like Docker and Kubernetes provide the perfect building blocks for us to revolutionize data infrastructure!
Pachyderm is “Git for Data Science.” We offer complete version control for data and give your data science team the same first-class development tools as software developers. Pachyderm is ideal for building machine learning pipelines and ETL workflows because we track every model/output directly to the raw input datasets that created it (aka: Data Lineage).
Since everything in Pachyderm is a container, data scientists can use any languages or libraries they want (e.g. Spark, R, Python, OpenCV, etc) without any additional infrastructure overhead.