We Need to Learn How to Throw Away Data

As an experimental particle physicist working at the Large Hadron Collider, learning to handle large amounts of data is a necessary skill. In addition, one needs to learn how to differentiate between something that is useful and something that is not. This is similar to what most of us have to do to live our lives every day. To find the Higgs boson, we were looking for a very small signal in a huge amount of background (or noise). So the goal was to sift through the data to throw away the background and find the signal. Sometimes you know what signal you are looking for and sometimes you don't.

In our field, advances have been made that allow us to take more data and record it (we take a snapshot of what is in our detector almost a billion times a second). Our detector (camera) has a 100 Mega pixel image which can be compared to a digital camera with 5-10 Mega pixels which can take about 20-30 pictures per second. We actually don't record most of the data we take, we only record maybe 100 "pictures" per second. We throw data away, hopefully it is all background, or uninteresting stuff. We have to do it wisely and know how much signal we could be throwing away as well. After throwing away a large portion of our data, we still have a lot to sift through. One can look at a 2013 Wired magazine article to see how our data size stacks up in the world. Our dataset is puny compared to Facebook and others. Aren't you glad we aren't trying to upload all of our events to Facebook for our friends to look at? Notice that the largest amount is in emails.

Once the data is stored, the analysis begins. We know there are physical characteristics of the pictures that we take that are interesting so we have to run the equivalent of facial recognition software on all the pictures to find an event that looks like a Higgs boson. The first thing you have to define is what a Higgs boson looks like. We do a statistical simulation of this using a theory model with our detector fully simulated. We call this Monte Carlo data. A Higgs boson tends to have higher momentum particles in it for instance. So I start by "cutting" out data which doesn't have high momentum particles. So we can keep applying more and more different cuts that are more efficient at selecting signal over background. In the end, I know how many signal events I found out of how many I generated using the cuts (signal efficiency). I also have models for background events that are based on real data or on Monte Carlo simulations so I also can have an efficiency for the background to pass my cuts.

As the quantities that I cut out events can be correlated, we use multivariate analysis techniques such as neural networks or boosted decision trees where we set the computer to explore the correlations and find the most optimum set of variables and cuts. These machine learning techniques are being used for many applications now and it is big business to figure out how to throw away data in all kinds of applications. In particle physics, the ATLAS collaboration is now paired with Kaggle for the Higgs Boson Machine Learning Challenge. Here, "The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment."

One of our postdocs looking for a job has also gotten a lot of interest in the skills he would bring to the data mining marketplace. Businesses are looking to make a profit by finding and exploiting small correlations to predict who will buy something. Security firms and the government are searching for risk assessment, and so on and so on. We are getting more and more data in our lives and we all must become proficient in throwing away data to get to the signals we want to see in the increasingly noise dominated world.