Bias-variance tradeoff refers to determining a machine learning model that balances complexity to minimize errors due to bias and variance. The ideal scenario is low bias and low variance. Even though not always possible, one must figure out which model offers the most accurate prediction.
Bias pertains to the expected difference between the parameters an algorithm learned and a model’s parameters that fit your data. It measures how far off the model’s predictions are from the correct value or the outcome you’re attempting to predict.
Supervised learning algorithms have varying levels of bias. Linear regression and logistic regression are two algorithms with high bias—they oversimplify the model, resulting in high error rates on training and test data. Low-bias algorithms like decision trees, K-nearest neighbors, and support vector machines are more complex but have better predictive performance.
Variance refers to the variability of a model’s prediction for given data. It measures how much the model’s outcomes change when using different training datasets. While some variance is expected using another training dataset, it should not be significant if the algorithm can identify the pattern between input and output variables.
Supervised learning algorithms with high variance like decision trees, support vector machines, and K-nearest neighbors focus on the specifics of training data that affect parameters used to define the mapping function. Linear regression and logistic regression are low variance algorithms that yield low discrepancies on test data amidst different training datasets.
Underfitting occurs when a model fails to recognize important regularities within the data. The model has high bias and low variance. It is too simple to capture complex patterns, especially when linear and logistic regression algorithms are used for nonlinear datasets.
At the other end of the spectrum is overfitting. When a model is overly complicated or specific, it captures noise and patterns in the training dataset.
The bias-variance tradeoff is one of the many things you’ll have to consider when evaluating your supervised learning algorithm. Reduce the challenges of building models by managing your data with Pachyderm. Sign up for a free trial to see how the platform can help scale data management.
« Back to Glossary Index