What Are Skew Tests?
Skew tests or tests of skew measure the asymmetry of an ideally symmetrical probability distribution. They test the normalcy of a particular dataset, and the resulting figure, known as skewness, displays the degree and direction of skew or deviation from horizontal symmetry. The further the skewness is from zero, the more the dataset deviates from the normal distribution.
The best type of probability distribution is a normal distribution where the mean, median, and mode of a dataset all lie in the middle, creating a bell-shaped curve when graphed. However, if the measurements of central tendency are spread out, the normal distribution is distorted and appears skewed to the left or right.
Why Is a Test of Skew Important?
Now that you know what a skew test is, you might be wondering why it’s important in data science.
Skewness in data distribution can significantly affect a model’s predicting capabilities. For example, let’s say you want to predict the prices of luxury homes. If your model is trained on a dataset with more moderately-priced homes, it will likely yield inaccurate or unreliable predictions.
Because actual datasets are rarely evenly distributed, a test of skew helps developers create better, more accurate models. Skewness disregards model assumptions or lowers the importance of a dataset feature.
Correcting skewed data is done by transforming it. Below are methods to modify it into normally distributed data:
- Log Transformation: Apply the logarithmic function (log(x)) to every value
- Square Root Transformation: Use the square root function or sqrt(x)
- Reciprocal Transformation: Apply the reciprocal function or 1/x
- Box-Cox Transformation: This calculation only works for positive data
How you transform the data to clean and use it for modeling depends on its statistical characteristics.
Manage Data Better with Pachyderm
Real-world data is often skewed, and skew tests confirm such an anomaly. You can leave it messy, as it will impact prediction. Pachyderm makes it easier to manage datasets for machine learning models. Try it for free today to see how it can help you get from development to deployment faster.« Back to Glossary Index