I recently blogged about the 5 C's of Big Data, and have since received inquiries about the "C" I named "Content," which was related to the size of large files or objects. The concept of large object data has been around IT for years, and databases support this type of data. Although there are no formally agreed upon parameters for what constitutes a large object -- I found a vendor that documented 32K as being large object data and another that suggested 8 TBs -- I arbitrarily set the cutoff in the 1GB range for a single object; anything 1GB or larger in my opinion is large object data. I grouped the large object data in the Big Data bucket because of the explosive growth in large objects in many organizations, which corresponds to the rapid data growth that has been called Big Data. More importantly, organizations have seen a growth in the need to perform analysis on this large object data, and that is really where the Big Data problem rears its ugly head.
Large object data can come from several types of systems and domains. Video and audio files are sources where single files can sometimes exceed 1GB, and the growth in these data sources continue to grow. One hour of video in high definition formats can range from +/- 1 to 2 GBs in size for 720p and 1080p, respectively. YouTube reports 72 hours of video are uploaded every minute. And although that alone is a Big Data problem for YouTube, organizations that rely on monitoring news feeds may desire to monitor YouTube feeds as well, since YouTube videos can be a source of news, which could be a Big Data problem for them.
Specific industries have even larger dependence on video, such as video analysis for sports teams. Audit logs on heavily used systems can sometimes be a source of large object data. Although most organizations prune their audit logs long before they reach this size, I have seen organizations with single audit log files exceeding 5GBs. Performing analytics on this size of single file data can be complex, although, unlike some of the other types of large object data, large audit logs can often be easily split for analysis purposes. In the bioinformatics world, genome sequence files typically exceed 1GB, tens of GBs, and even hundreds of gigabytes. Organizations that analyze and compare genome sequences are working with a Big Data problem, and one of the driving costs for genome sequencing. Shapefiles used for maps on websites that aggregate data can often be in the 100s of MBs to GBs in size, and can present a Big Data issue when comparing shape files of the same area for differences. There are several other examples of large object data that are typical in specific industries and organizations in general.
Some of the tools used for processing typical Big Data issues are not necessarily suited for large object data, without some process up front to split the files. STORM, for example, is better at processing smaller, streaming messages. Of course, there are several examples where solutions applicable to other Big Data problems are successful working with large object data. There are libraries for genome sequence processing using HBase. HDFS is designed to support large files, with a theoretical limit of 512 yottabytes using a 64MB (default) block size. I hope I am retired before 512 YB files are the norm in IT.
Large object data will continue to grow across organizations. Fortunately, there exist cloud-based solutions that can help defeat these issues.