A Comprehensive Approach To Unstructured Information Governance: Leveraging Legacy Data Cleanup For Informed Classification

A Comprehensive Approach To Unstructured Information Governance: Leveraging Legacy Data Cleanup For Informed Classification

As an information governance technology vendor, we have had the privilege of working with corporations of different sizes on implementing governance across their unstructured data. One common question that arises during these engagements is where to start: Should businesses focus on tackling legacy data cleanup first, or should they prioritize setting up rules and controls to stop the bleeding and automate retention for newly created content? In this article, we wanted to explore the pros and cons of each approach and provide recommendations based on real-world experiences.

First, let’s lay out some general pros and cons for each approach:

Tackling Legacy Content


  1. Immediate Impact: Addressing legacy content first allows businesses to quickly identify and remediate outdated, redundant, or trivial (ROT) data. It also allows them to reduce storage costs and minimize legal and regulatory risks.
  2. Clearing the Slate: By cleaning up legacy content, organizations create a clean slate for implementing consistent information governance practices moving forward. This avoids the accumulation of further data sprawl.


  1. Resource Intensive: Tackling legacy content requires significant resources in terms of time, manpower, and technology. It may involve manual review and classification of large volumes of data, which can be labor-intensive and costly.
  2. Disruption: The process of purging or archiving legacy content may disrupt existing business processes and workflows. This process may lead to resistance from end-users and potential operational challenges.
  3. Complexity: Legacy content is often stored across multiple systems and repositories, making it challenging to identify and classify data consistently. This complexity can hinder the effectiveness of cleanup efforts.

Tackling Retention for Newly Created Content


  1. Proactive Approach: Setting up rules and controls to automate retention for newly created content allows organizations to proactively manage data from the point of creation and reduce the accumulation of legacy content.
  2. Efficiency: Automation makes the retention process efficient, reducing manual intervention and ensuring consistent application of retention policies across all data sources.
  3. Compliance: By automating retention, organizations can ensure compliance with legal and regulatory requirements from the outset. This minimizes the risk of non-compliance penalties.


  1. Limited Impact on Legacy Content: While automating retention for newly created content is essential for long-term governance, it may not address the immediate challenges posed by existing legacy content, such as ROT and data sprawl.
  2. Implementation Complexity: Implementing automated retention rules requires careful planning and configuration to ensure they align with business needs and regulatory requirements. This complexity can delay implementation and increase project costs.
  3. User Adoption: Employees may resist automated retention rules if they perceive them as restrictive or burdensome, leading to challenges in user adoption and compliance.
  4. Long-Term Investment: The immediate effect of this approach is that data is retained. Compared to deleting data (and the cost savings associated with it) this makes the ROI less visible to company stakeholders, and makes it harder to communicate the importance of the project to them. It is essential to make sure that the data that is created today doesn’t become legacy ROT in the futrue, but that is the long-game, and short term ROI is often essential for new initiatives to gain widespread support.


Based on our experiences working with corporations, it’s evident there is no one-size-fits-all. It’s also important to remember that this is just a consideration of where to start; organizations will never be successful unless their strategy eventually balances both. Having said that, we do have a recommendation.

In our journey towards implementing effective unstructured data governance, it’s crucial to recognize the invaluable role that legacy data cleanup plays in informing the classification of documents on a go-forward basis. By analyzing the intelligence gathered during legacy data cleanup, organizations gain deeper insights into their content landscape. This information enables them to make tailored, and contextually important updates to their classification system (i.e. records schedule) prior to implementing it on a go-forward basis.

Legacy Data Cleanup: A Foundation for Informed Classification

During the process of legacy data cleanup, organizations embark on a comprehensive review of their existing data repositories to identify and remediate outdated, redundant, or trivial (ROT) data. This process involves analyzing the content of documents, understanding their context, and determining their relevance to the organization’s business objectives and regulatory requirements.

As organizations look into their legacy content, they uncover valuable insights into the types of documents, their content, and their relationships within the data landscape. This intelligence serves as a foundation for establishing classification schemes and metadata structures that accurately reflect the organization’s information assets. Too often we have seen organizations create classification categories based on what they expect there to be in their company’s data, based on interviews with other people at the company. However, when the data is actually analyzed, there is almost always content that does not fit into that categorization scheme.

Informed Classification for Newly Created Documents

Armed with the intelligence gathered from legacy data cleanup, organizations are better equipped to classify newly created documents in a more granular and informed manner. This includes identifying document types, assigning appropriate metadata tags, and applying retention policies based on the document’s content, context, and business value.

For example, consider a pharmaceutical company that underwent a legacy data cleanup initiative to make its data repositories efficient and improve compliance with regulatory requirements. During the process, the organization discovered a significant amount of non-clinical unstructured data related to a company acquired a decade prior. Not only did this force an important conversation about the true value of that content and whether it still needed to be kept, but also set the stage for a process that could be applied to the data of newly acquired companies in the future.

Case Study: The Large Financial Institution

In our experience working with a large financial institution, the legal and compliance team had invested heavily in revamping their records schedule to align with international regulations. They were eager to start setting up policies and implementing retention actions on a go-forward basis.

However, we advocated for prioritizing legacy data cleanup to establish a return on investment (ROI) early in the process and maintain executive buy-in. Applying 7-year retentions to newly created content was important but was simply not as visible and tangible as deleting old content and the potential liabilities along with it. During the cleanup process, we discovered whole categories of information that were not accounted for in the revised records schedule. The information included historical financial data, customer communications, and operational reports.

By addressing these gaps in the records schedule and establishing an understanding of the organization’s legacy content, we were able to develop a more robust classification scheme for newly created documents. This not only enhanced compliance efforts but also improved the organization’s ability to manage and leverage its information assets effectively.

Balancing Legacy Data Cleanup and Informed Classification

Legacy data cleanup serves as a critical foundation for informed classification of documents on a go-forward basis. By leveraging the intelligence gathered during the cleanup process, organizations can establish more accurate and granular classification schemes that align with their business objectives, regulatory requirements, and operational needs.

As illustrated by the case study of the large financial institution, prioritizing legacy data cleanup early in the process can uncover valuable insights and gaps in records schedules. It guarantees a  broad and effective approach to unstructured data governance. By striking a balance between legacy data cleanup and informed classification, organizations can unlock the full potential of their information assets while minimizing risks and ensuring compliance with regulatory requirements.


About The Author