Did the FBI use software to scan 650,000 emails, word for word, in 8 days? originally appeared on Quora: the knowledge sharing network where compelling questions are answered by people with unique insights.
There is a lot of misinformation going around in public about this, so I will try to give an informative answer about this here.
Trump said the following in one of his rallies yesterday, basically adding fuel to the fiery narrative of the election being rigged:
"You can't review 650,000 new emails in 8 days. You can't do it, folks."
As Obama would've said, yes we can. Sorry to break it down to you, Mr. Trump, you definitely can review 650,000 emails in 8 days. I have personally done more than twice that number in less than one-fourth the time. How do I know? Because I do this for a living. The bottom line is that you don't need to review all 650,000 emails. Heck, you don't want to review such a huge population, in fact--that would be utterly stupid, let alone a huge waste of time and money. Instead, what you do is that you weed out emails that are not responsive to what you are looking for. There are a number of steps FBI would have taken to cull down the review population, which would have involved various forensic and investigatory tools. Now, I don't know the exact workflow FBI followed, but I can give you one potential way you can reduce the 650,000 email population to a manageable review population:
- De-duplication: First, we ingest the 650,000 emails into a processing tool like or . These tools, especially Nuix (it is blazing fast), would extract metadata (e.g., Email From/To/CC/BCC, Email Subject, Date Sent, etc.) and text from all emails. More importantly, these tools identify exact duplicates via a process called . The software will basically compare the (a unique ID) for all 650,000 and remove the duplicate emails. We can further reduce the population by comparing the new email dump against the old, 30,000 email dump FBI had obtained in the earlier Clinton email investigation. The software will automatically ignore the common/duplicate emails between the two email sets, thereby reducing the population even further.
- Email Filters: After de-duplication, on the remaining population, we can run a search in the software to filter out emails that have Hillary's email address(es) in the Email From/To/CC/BCC fields. This would remove all emails where Hillary is not a participant, as those emails would not be relevant to the investigation. As the emails collected belong to Anthony Weiner, it is safe to assume that this would cut down the population by hundreds of thousands of emails. Furthermore, I don't know the scope of the investigation, but FBI could have narrowed down the population even further by only focusing on emails that were sent within a particular date range or among a specific set of people.
- Email Threads: After email filtering, we can reduce the remaining responsive emails by only reviewing the most-inclusive conversation threads. What does that mean? Let's say, I sent an email to you and our common friend, you reply to that email, and then our common friend replies to that email. Instead of reviewing all three emails separately, it makes sense to only review our common friend's email, as it already contains my original email and your reply to it. (there are other tools as well), which was bought by Microsoft last year, is a popular tool that can identify the most-inclusive email threads quite easily.
After all these steps, I won't be surprised if FBI only had hundreds to a few thousand emails to manually review at the end. If they had an idea about what they were looking for, which they could have gotten through their earlier email investigation, they could have also done keyword searches to cull down the review population even more.
This was just one example of a simple workflow that can be used in this case. There are umpteen other tools/utilities, such as, , , , etc. that can be used in such an investigation.
To summarize, as you can see, it is not inconceivable that FBI was able to finish the review in less than 2 weeks. In my opinion, they actually took more time than necessary to finish the review. As they say, inefficiency is the hallmark of most government organizations.