The data's too big
Overwhelmed by the sheer magnitude of information, you decide to do what seems like some modest data trimming by rejecting outliers. You make a histogram of each of the 600 quantitative measurements and label as an outlier any value that is outside the inner 99% of the distribution. Now, you go through the list and eliminate all customers for whom one of their measurements is an outlier.
How many customers do you expect to have left to analyze?
- For simplicity, approximate the histogram of each of the 600 quantitative measurements as a uniform distribution between its minimum and its maximum value.
- Assume there is no correlation in a user being an outlier in one measurement vs another.