Mitigating Growing Data Volumes: The Past, Present, and Future of Discovery

Mitigating Growing Data Volumes: The Past, Present, and Future of Discovery

Data is being created and stored at rapid rate. By 2025, the International Data Corporation (“IDC”) estimates worldwide data will reach 181 zettabytes, representing a compound annual growth rate of 23% over the 2021-2025 period. The IDC also estimates a 19.2% CAGR in worldwide storage capacity over the same period, growing to above 16ZB. For organizations, such dramatic growth of stored data means more data is subject to discovery than ever before. Thus, organizations will face even higher cost to collect, process, host, review, and produce discovered data. This post explores the evolution of approaches to mitigate the costs of discovery on ever increasing data volumes and considers what strategies might be successfully deployed next to prepare for the future of discovery.

A Note on Proportionality 

The legal concept of proportionality is a cost-benefit analysis, dictating that the burden of discovery should be weighed against the importance of the information to the case at hand. Proportionality guidelines have served as one way the industry has attempted to manage the size and scope of discovery demands against increasing data volumes.  

The United Kingdom has had greater success defining and implementing the practice of proportionality to limit discovery burdens. Under the UK’s Civil Procedure Rules (CPR 1.1), courts address proportionality based on:

  • the amount of money involved
  • the importance of the case
  • the complexity of the issues
  • the financial position of each party 

Disclosure data volumes are often still burdensome enough to require advanced technology and large-scale review prior to production. Ironically, the existence of such technology has served to increase the expectations of courts, and thus the extent of discovery obligations. 

The Federal Rules of Civil Procedure attempt to center proportionality in its electronic discovery amendments, but with divergent views set forth in varying state laws and court rulings , the impact and application of standardized proportionality guidelines has been decidedly less consistent than in the United Kingdom. Complicating matters further, less restrictive privacy regulations in the US have enabled businesses to retain much of their data indefinitely. Therefore, not only are increasing data volumes burdening discovery, but regular data clean-up activities, which would naturally limit data growth, are not taking place. 

Traditional Approaches to Mitigating Discovery Costs 

Traditional approaches to mitigating the effect of growing data volumes on discovery fall into three categories: data culling, review acceleration, and brute force.  

Data Culling

Data culling techniques represent some of the earliest approaches to reducing data volume and are still regularly employed. The goal of data culling is to reduce the hosted data set by using document metadata and text to eliminate redundancy and less relevant data. Data culling techniques are typically applied from the time documents are ingested by a processing tool until the documents are promoted for review. 

De-Duplication and DeNISTing have been the most common data culling approach to eliminating redundancies and junk files. These techniques are generally applied during data processing, and both filter documents by unique identifiers or hash values. 

  • De-duplication refers to the process of comparing electronic documents based on their content and characteristics, and removing duplicative records from the data set so that only one instance of an electronic record is reviewed.  
  • DeNIST applies the National Institute of Standards and Technology’s Master list of hash values for known traceable computer applications to eliminate filetypes known to be of no value to discovery.  

Data filters use metadata and document text to target documents for review. Filters are most frequently applied prior to document review during data processing or early case assessment phases. Data filters include:  

  • Filetype filter: File inclusion and exclusion lists operate similarly to DeNISTing, ensuring that specified filetypes are included or excluded based on their perceived value to the matter. Discovery vendors often have standard filetype lists applied by their processing team.  
  • Date range filter: Date range filters eliminate documents from a data set that are not associated with a specified date range. Date range filters can be applied broadly across the entire dataset or narrowly (e.g., to a specific document custodian).  
  • Other metadata filters: Filetype and date range filters are examples of common metadata filters. Other common metadata filters include file size filters (flagging documents that are unreasonably large, often in anticipation of code), domain filters (limiting emails to those with specific email domains), and locations (limiting documents to those with specific file paths).  
  • Search terms: Search terms limit results to documents containing specific terms or expressions. Terms can be crafted with exact phrases or can employ search syntax that allows for word stemming, spelling variation, wildcards, or position relative to the document or other words within the document. Terms can be joined by Boolean logic.

Though there is no shortage of flexibility with search syntax and Boolean logic, search terms are often overbroad, over-narrow, and miss documents because they cannot account for human factors and language intricacies.  

Review Accelerators

Review accelerators represent the second level of traditional approaches to mitigating the impact of growing data volumes on discovery. In contrast to data culls, the goal of review accelerators is to reduce the number of billable hours it takes to complete a document review. Examples of common review acceleration approaches include email threading, near duplicate grouping, and unsupervised or supervised machine learning.    

Reviewers get through similar documents quicker. Email threading and near duplicate grouping techniques are structural analysis tools that help organize documents for increased review speed. These tools analyze the textual composition of a document text and use it to group related or textually similar documents in a way that makes sense for review. 

  • Email threading groups all replies and forwards stemming from every root email into email thread groups. Sorting documents for review by email thread group helps accelerate document review by increasing the speed by which the reviewer can complete their review. It also serves to increase the consistency with which documents are reviewed, and saves time on quality control and second level review. Document review speeds can accelerate by 15-20 percent when reviewing an email thread in order versus having that thread dispersed piecemeal across multiple review batches. This increase can translate to an additional ten or more documents reviewed per hour per reviewer.  
  • Near duplicate grouping compares documents’ text and groups documents that have above a certain percentage of similar text into a near duplicate group. Sorting Near Duplicate groups together for review similarly increases review speed.  

In contrast to structural analysis, unsupervised and supervised machine learning tools help organize and prioritize documents for review based on the conceptual content of the documents rather than text structure or metadata.  

  • Unsupervised machine learning tools like clustering group documents determined by the technology to be conceptually similar (without any user input or machine training). Conceptual groups or clusters are often named with a string of words representing some of the more common concepts that bind documents within the cluster together. Tools like clustering can be used to gain insight into your review population and inform a more strategic review. Clustering can also help prioritize concepts of interest earlier in review and identify potentially irrelevant documents such as mass marketing emails that can be set aside.  
  • Supervised machine learning differs from unsupervised machine learning in that a user can define the buckets documents will be grouped into by identifying examples for the machine learning model to learn from. Supervised machine learning is typically applied towards the beginning of the review phase after traditional data culls have already been applied. TAR 2.0 and Continuous Active Learning techniques are the most prevalent examples of supervised learning in discovery. TAR/CAL accelerates review by identifying important (or relevant) documents for priority review. As review continues, the remaining set of documents is increasingly less relevant, finally reaching a point where the documents being reviewed are overwhelmingly irrelevant, and the remaining population can be sampled and set aside. Depending on the richness of a review population, TAR/CAL can cut the data size in half. However, certain documents – such as those that are heavily numeric (e.g., spreadsheets), or those lacking text often require independent review, as supervised learning is not built to handle this type of data.  

Even with data culls and review accelerators, legal teams often take weeks of review to identify key documents that support their arguments. As a result, recent discovery trends focus on Early Case Assessment (“ECA”) tools like data visualizations and conceptual analytics to help attorneys better identify key documents and understand their data earlier in the electronic discovery lifecycle. ECA results not only in a more effective approach to data culls, but also in a more thoughtful and conscientious approach to document review organization and structure.  

Brute Force

Layering data culls with review accelerators has served as the chief method to mitigate discovery costs, with ECA improving our strategy and understanding earlier in a matter. However, even with layering data culling and review techniques, data populations can still be large enough to require brute force – that is, deploying review teams of 50, 100, or 200 contract reviewers on a matter to plow through data in the fastest way possible. The problem with brute force, of course, is that while it can reduce the time it takes to tackle large data volumes, unlike data culling or review accelerators, it will likely increase the overall cost of review.  

The Future of Discovery

While ECA technology represented a step forward, empowering legal teams to reduce discovery costs and identify key documents faster, ECA still occurs after spending time and money to collect and ingest significantly more documents than are required for a given matter.  

ECA, but in Real Time 

The future of discovery will build on the benefits of ECA and traditional approaches to mitigating the effects of growing data volumes, but seek to conduct this analysis on data in place, in real time, before collecting a single document. Tools with these capabilities exist right now.  

Information governance technologies like Rational Governance empower organizations to conduct many aspects of e-discovery in house – from document identification and analysis to collection. Organizations wielding the power of governance platforms can find potentially relevant documents immediately using advanced search and machine learning, test and refine searches on the fly, and then collect and export only truly relevant data directly to review platforms (like Rational Review). This process eliminates collection and processing fees, and more importantly, minimizes the amount of inconsequential data sent for attorney review, further reducing discovery costs. 


Just as the amount of data we produce is growing, the amount of data stored by organizations is growing exponentially, nearly doubling in size between 2020 and 2022, according to a report from Seagate and IDC. While enterprise data can be used to drive better business decisions, it also comes with increased risk factors, including being subject to discovery. For this reason, an organization’s records management policies and technologies to enforce them are crucial to mitigating discovery costs. Organizations must clean up data that is no longer of value and not subject to records retention guidelines, especially ROT data. Like the over-collected, irrelevant data you are trying to remove from your discovery data set, ROT data offers little value to an organization. 

ROT stands for Redundant, Obsolete, and Trivial; three types of data an organization does not need to retain. 


  • Redundant data refers to multiple copies of the same data across an organization and is one of the primary causes of exponential data growth. 
  • Obsolete data is data that no longer serves a purpose, either because it is no longer accurate; is representative of a legacy organization, project, or product; or has long outlived data retention requirements. 
  • Trivial data is purposeless data, untouched and merely taking up space on servers (for instance, the byproduct of daily tasks never to be revisited) 

The good news is that information governance technologies like Rational Governance can also help organizations clean up the duplicative and valueless data that clutters their storage by employing techniques similar to the review accelerators and data culls discussed earlier in this article, but applying them on data in real time. For tips to identify and clean up ROT data, check out our blog postUsing Technology to Identify and Delete ROT Data.

It’s no accident that the Electronic Discovery Resource Model (“EDRM”) begins with Governance. Inadequate records management policies set organizations up to overspend on discovery, beginning with overcollection. Overcollection increases the cost of nearly every phase of discovery that follows.  

Discovery’s answer to data growth cannot be continuing to over collect, especially when technology exists that precludes the need to collect and process data to conduct detailed analysis. The future of discovery inevitably resides at the confluence of data governance and real time search and analysis technologies, not post-collection solutions. 

Sarah Cole

About The Author

Sarah Cole

Senior Director of Consulting
Sarah Cole serves as Senior Director of Consulting at Rational Enterprise. As a technologist and eDiscovery veteran, Sarah is best known for her ardent advocacy of predictive and textual analytics in the legal technology space.