Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

ChatGPT Builds NLP Excitement in the ML Space

Natural language processing and conversational intelligence are machine learning applications that are always gaining interest in industries like ecommerce, fraud detection, and healthcare.

Basic NLP has been used for applications like simplifying customer service: for instance, telling a retail store chatbot “I’d like to make a return.” Instead of navigating the website menu to find your orders and returns, the chatbot can link customers to the relevant page.

The excitement about ChatGPT is its application of subsets of Natural Language Processing: Natural Language Understanding, and Natural Language Generation. Natural Language Understanding levels up NLP to follow logical connections between the content of a phrase, allowing users to speak more naturally with an AI partner by connecting concepts: “I’d like to return the t-shirt in my most recent order for store credit.” This improves the conversational experience for human users by providing an experience more similar to the flow of natural speech.

Natural Language Generation applies that same complexity to the response sent back to a human user: responses make use of the natural flow of conversation, and can also connect related concepts. “We’ve initiated the return for your red t-shirt. Now that that’s taken care of, would you like to shop for purple t-shirts?”


ChatGPT is the latest project published by OpenAI, the team behind DALL-E 2. They’re always researching and investigating creative and thought-provoking applications of machine learning and artificial intelligence.

It was a major hit, getting attention from researchers, journalists, writers, and creatives. The conversational style of ChatGPT makes it a compelling figure to chat with about your problems and questions, but it has some potentially massive blind spots that make it an unreliable tool for understanding nuanced issues or answering questions with a complex variety of potential answers available to ChatGPT’s information sources.

What do ChatGPT’s answers mean?

Naturally, team Pachyderm being who we are, the data used to generate ChatGPT’s answers is what’s really interesting and exciting. OpenAI has shared that ChatGPT relies on a training dataset, and their blog post includes a great overview of the response training process: educated reviewers score responses, and reward or penalize the model based on accuracy and quality of response. The blog post also notes, however, that reviewer selection has a massive impact on the types of responses that get reinforced. One factor to keep in mind as a user is that reviewers showed a strong preference for long blocks of text, rather than shorter answers.

The potential risk to users is that ChatGPT has been trained to write authoritative responses, and may be biased to write authoritative responses even when the data available to respond to a query does not show a clear, correct response to give: ChatGPT has a habit of blowing hot air.

We take ChatGPT for a spin, and ask it some common questions that people curious about machine learning have put in as search terms recently:

Question: Why do we use pipelines in machine learning?

ChatGPT Response: “Pipelines are used in machine learning to automate the process of building a model. This is important because building a model typically involves a number of steps, such as preprocessing the data, training the model, and evaluating its performance. By using a pipeline, these steps can be combined into a single, automated process that can be easily repeated and adjusted as needed. This makes it easier and more efficient to develop machine learning models, and can help ensure that they are built in a consistent and reproducible manner.”

Our Verdict: A good, simple answer: your pipelines move the code and data used by your model to create a repeatable, scalable process. Want to dive deeper into why we use pipelines in machine learning, and how data fits into that equation? Completing the Machine Learning Loop is a must-read.

Question: How do version control systems work?

ChatGPT Response: “Version control systems are used to track changes to a file or set of files over time. This allows users to revert back to previous versions of the files if needed, and to see who made changes to the files and when.

Typically, a version control system will store a history of changes made to the files, along with metadata about each change, such as the date and time it was made and the user who made the change. This history can be used to track the evolution of the files over time and to see how they have changed.

To use a version control system, users typically create a local repository on their own computer, which contains a copy of the files that are being tracked. They can then make changes to the files in the repository and commit those changes to the repository, which records the changes in the version control system. Users can also push their changes to a remote repository, which allows other users to access and collaborate on the files. This enables teams of people to work together on the same files and to easily manage changes made by multiple users.”

Our Verdict: This question doesn’t ask specifically about machine learning, and this explanation of version control makes it clear why and how someone would implement version control for their files. For machine learning teams, version control applies to 3 parts of your MLOps: Models, Data, and your Pipelines (How the code and data operate together).

Question: What is input in machine learning?

ChatGPT Response: In machine learning, input refers to the data that is used to train a model. This data is typically provided to the model in the form of a dataset, which consists of a collection of examples that the model can use to learn from. For example, a dataset might consist of a set of images and their corresponding labels, which the model can use to learn to recognize different objects in the images. The input data is typically processed by the model to extract features that are relevant for the task at hand, and these features are then used to train the model. The quality and relevance of the input data can have a significant impact on the performance of the trained model.

Our Verdict: In machine learning, an input is the data used by a machine learning model. Pachyderm takes the concept of an input and expands it to include the data and the pipeline code used to transform the input as it goes through the model, complete with a version-controlled global ID.

Question: What does data-centric mean?

ChatGPT Response: Data-centric means focused on data. In a data-centric approach, the emphasis is placed on the data itself, rather than on the processes or systems that operate on the data. This means that the data is treated as a valuable resource, and efforts are made to manage, protect, and extract value from the data. In a data-centric approach, data is often seen as the driving force behind many business decisions and processes.

Our Verdict: Preach it! 🙌 This is a great explanation of data-centric processes, like data-centric AI.

Question: What is data lineage in data governance?

ChatGPT Response: Data lineage is a concept in data governance that refers to the history of data, including where it came from, how it was transformed and used, and where it is stored. Data lineage is important because it helps to understand the origins and evolution of data, and can provide valuable information about its quality, accuracy, and reliability.

In data governance, data lineage is typically tracked and managed through the use of metadata, which is data that describes other data. This metadata can include information about the sources of the data, the transformations and calculations that were applied to it, and the systems and processes that were used to manage and store the data. By tracking and managing data lineage, organizations can gain a better understanding of their data assets, and can make more informed decisions about how to use and manage them.

Our Verdict: This answer is a pretty concise explanation for data lineage and the reasons it is most commonly used in technology. Lineage is critically important for industries like finance and healthcare, where legal compliance requires auditable data processing.

And one final question we’re all dying to know:

What does it feel like to be a machine learning project?

ChatGPT Response: It is not possible for a machine learning project to have feelings, as it is a collection of algorithms and data, rather than a conscious entity. Machine learning projects are designed to analyze data and make predictions or decisions based on that data, but they do not have the ability to experience emotions or consciousness.

Our Verdict: This is the topic where ChatGPT is the expert…. unless that’s just what ChatGPT wants us to think!

As conversational intelligence and natural language capabilities become more accessible to researchers, businesses, and creatives, new opportunities will emerge for AI to scale communication, customer service, and general understanding for people from all walks of life.