My blog last month on data efficacy was well received. Data efficacy is essentially defined as the ability to manage data in such a way to drive informed decisions. I discussed three factors that have contributed to the lack of data efficacy over the years, and promised to provide a blog this month covering some analytics that would help organizations measure their level of data efficacy. This blog will share my top five typical analytics that my organization uses for measuring data efficacy.
Because measuring data efficacy requires analyzing "big data" sources, one technology that shows promise for measuring data efficacy, and thereby allowing organizations to increase data efficacy, is Apache Hadoop. Hadoop is being used by several industries and organizations to provide analytics for big data sources. My organization uses Hadoop-based analytics to assess the level of data efficacy across an organization's enterprise data sources. Here are some examples.
Historical Value of Data
Whether looking at the date that data in a database was created or the date that a Word document was last modified on a user's computer, the age of information is fairly easy to compute, in most cases. Often though, data, unlike wine, does not become better (i.e. more valuable, in this case) with time. Often data loses value with time, except in cases where pedigree is most important in data derived from another source. Analytics that provide the data of information across enterprise data sources can enable a dashboard containing a timeline depicting historical value of data. I typically suggest using a blend of the number of artifacts (rows in a table, Word documents, etc.) created for a time period as well as the amount of storage required for the artifacts. This will help the user understand if individual data objects are growing in size over time. Also, the historical value of data should be computed by various factors, such as data source, data source type (e.g., RDBMS, NoSQL db, Microsoft Office documents [and maybe even the individual types therein], PDF, etc.), organization group (e.g., marketing, aales, etc.), time period (e.g., week of the month, day of the week, month of the year, etc.), among other factors. The drill-downs in the dashboard that these enable help users better understand the historical value of data across the enterprise.
This may appear to be an odd analytic to calculate, but there are examples of organizations that have petabytes of storage designed to get data in the hands of decision makers, that learn that a very large majority of the data contain high levels of duplicity. It is commonly apparent that duplicate data would increase costs to manage the data, since each individual instance of the same data requires separate storage, but duplicate data also leads to data inefficacy by obfuscating real information in search results. For example, imagine a user receives 100 results from a search across an enterprise document management system, and 30 percent of the results have near exact matches to the search terms, but reference data derived from a small group of three or four documents. The user could search through a page of search results and see the same information, and decide they there is no additional information to glean from the search, even though there are several original documents that contain new information. In the structured data world, third normal form practices typically reduce data duplication, but sometimes, for performance reasons, data may still be duplicated. For unstructured data sources, the emergence of enterprise content/document management systems has reduced data duplication somewhat, but derived content is still a leading contributor to duplicate data.
Document Originality analytics should compare content based on two metrics -- the exact and/or similar content and same or similar facts. The first of these requires a simple match on words with substitutions for synonyms or removal of superfluous wording. This is the easier of the two. The second metric requires comparison of facts derived from the content; this is typically used for unstructured data, but is also relevant for structured sources as well. Data originality should also be computed by data source, type of data source, and other factors where dashboard drill-downs provide relevant information. The goal in providing this data originality information isn't necessarily to remove duplicate data, but to make informed decisions about the cost of storage as well as the possible data overload in search results caused by the duplicate data.
There are actually two types of analytics that comprise this set of analytics -- data storage by data source and data storage of the data versus derived data -- that are used to show the cost of storage allocated to data sources across an organization. Providing the amount of storage required for a particular data source highlights one cost of that data source. As analytics become more prevalent in IT organizations, determining the storage required for the actual data versus derived data can help analyze the analytics' efficiency.
Content analysis provides details about the most common facts by data source, including the number of assertions by data source and across the enterprise that a fact is stated. This is useful to see common facts or type of facts found in unstructured and/or semi-structured data sources, where a schema and/or data model may not typically exist. It is also useful to find misuse of "standard lexicon" uses within unstructured data sources, such as documents that contain the phrase "act on this great deal" that may typically have required in depth analysis by a senior appraiser, but is being stated in documents authored by junior analysts. This analytic is typically performed for across all data sources in an enterprise as well as the individual data sources, and is typically viewed in a dashboard as a scrollable table of the top X results. This analytic can also be used to show top content that is being added to data sources across an enterprise by pivoting on just recent data (e.g., last day, hour, etc.)
This analytic is similar to content analysis in that it focuses on facts, and can therefore be visualized similarly, but provides details on the value of data in terms of search results, views, and updates. These analytics use audit logs across the enterprise as the primary source of data to determine users searching for data; accessing data; and, updating data. This analytic can typically represent the most useful data in terms of what users find to be valuable data.
Although there are many other analytics that help measure data efficacy, this list of my top five will get you well on your way to tackling any data efficacy issues.