The data's too big

You're consulting for an online retailer who has handed you an enormous spreadsheet containing every conceivable measurement of each of their 700,000 customers. Each line in the file contains 600 quantitative features corresponding to one of their customers.

Overwhelmed by the sheer magnitude of information, you decide to do what seems like some modest data trimming by rejecting outliers. You make a histogram of each of the 600 quantitative measurements and label as an outlier any value that is outside the inner 99% of the distribution. Now, you go through the list and eliminate all customers for whom one of their measurements is an outlier.

How many customers do you expect to have left to analyze?


  • For simplicity, approximate the histogram of each of the 600 quantitative measurements as a uniform distribution between its minimum and its maximum value.
  • Assume there is no correlation in a user being an outlier in one measurement vs another.

Problem Loading...

Note Loading...

Set Loading...