Getting Started with Data Sensitivity Classification

Getting Started with Data Sensitivity Classification

Implementing an information governance tool to help manage sensitive data can sometimes feel like an overwhelming task. With regulations varying across jurisdictions and a bevy of new laws pending, it’s easy to find yourself in a state of decision paralysis, wondering: where do I possibly begin?” Good news, this post was written for you! In this blog, we will propose that to begin to manage sensitive data, you should first classify data sensitivity into broad, widely applicable categories based on risk level. 

About Sensitive Data Classification 

The Center for Internet Security (CIS) outlines three categories for data sensitivity: ”Public,” “Business Confidential,” and “Sensitive.” We recommend slightly modifying these categories by splitting the “Sensitive” data classification into two subcategories for added nuance: “Sensitive – Legal and Intellectual Property” and “Sensitive – Personal Information.” 

  • Public Classification: Data classified as Public will be the least sensitive of all data in your organization. This dataset can include public filings, publications, press releases, job posts, and website content. This type of data will not require restrictive access and will pose the least amount of risk when breached. 
  • Business Confidential Classification: Data classified as BusinessConfidential will contain documents and communications for “internal eyes only.” Examples of Business Confidential Data include internal databases or communications and documents from individual employees, shared between internal teams, or with third parties/clients/prospective clients, which do not contain Personally Identifiable Information (PII), Protected Health Information (PHI), or otherwise sensitive intellectual property or legal content. Business Confidential documents represent documents that are not public and should remain confidential. However, when faced with a data breach, they would not cause the greatest degree of harm to the organization. 
  • Sensitive: Data classified as Sensitive is organizational data that poses the greatest degree of harm if breached and should be treated with a special degree of care and appropriate access restriction. This data is non-public data that contains personal, highly sensitive, or privileged business information or communications. 
  • Sensitive – Legal/IP Classification: Data classified as Sensitive – Legal/IP contains documents, databases, and communications including sensitive trade secret information, legal communication or work products, or an otherwise known organizational liability. 
  • Sensitive – Personal Information Classification: Data classified as Sensitive – Personal Information contains PII and PHI often subject to regulations such as GDPR, HIPPA, and various state regulations.  

Getting Started with Classifications 

A successful data sensitivity classification requires balancing the tools in your information governance toolbox with the level of confidence required for each classification. For example, falsely classifying a non-public document containing personal information as Public opens up your organization to greater risk than falsely classifying a Public document as Business Confidential. To avoid this type of error, we encourage tailoring your search criteria until you are able to validate your results for each classification to reflect a margin of error agreed upon by stakeholders.

Four tools you might use to identify documents for each classification include: keyword search, metadata search, regular expression search, and supervised learning models. 

We recommend layering these search technologies to achieve the desired classification.

  • Keyword Search: Keyword search can help retrieve documents containing specific language; for instance, lanuguage that you know implicates PII, represents a public facing document like a press release, or identifies standardized documents like routine financial reports. 
  • MetadataSearch: Metadata search can help hone in on documents or communication from certain sources, file path locations, or even custodians that are likely to require a protected classification; it can also be used to search document titles for known formats or conventions used in routine reports or contracts. 
  • RegularExpression: Regular expression searching allows you to search for patterns of numbers and letters, such as a social security numbers, bank account numbers, form data, phone numbers, and more. Regular Expression search is an important tool in identifying PII and PHI in organizational data. 
  • SupervisedLearning: Supervised learning models allow users to provide examples of the types of documents they would like to identify, and similarly, documents that are not representative of the types of documents they would like to identify. A learning model is created based on the examples provided and can then be applied across all organizational data. Models can be built for a datatype as narrow as Accounts Payable Contracts or as broad as one of the four main data sensitivity classifications mentioned above. The results can be refined to meet the margin of error most suitable for your task. We recommend using supervised learning tools in conjunction with other search tools to build out a dynamic classification search for each classification. 

Evergreen Classifications 

You’ve done all the work to classify your data, now let’s look forward. New data is constantly being created, almost automatically. Information governance tools like Rational Governance don’t just stop with classification of pre-existing data, but allow you to apply classifications to data on a goforward basis with Evergreen Classification. Evergreen Classifications are essentially persistent searches that stay up to date on all data meeting your data sensitivity search criteria, and therefore, its appropriate sensitivity classification. 

To make sure your classifications continue to accurately reflect changing organizational data, be sure revisit your dynamic classification searches once a year; sample new results and unclassified data to refine your supervised learning models and search criteria as necessary.

Identifying and classifying sensitive data is not instant, but at the same time, it is more straightforward than many businesses realize. All it takes is awareness of a company’s data landscape, a planned approach to classifying data with appropriate sensitivity levels, and the right technology in place. To get started with sensitive data classification, contact us today. 

Sarah Cole

About The Author

Sarah Cole

Senior Director of Consulting
Sarah Cole serves as Senior Director of Consulting at Rational Enterprise. As a technologist and eDiscovery veteran, Sarah is best known for her ardent advocacy of predictive and textual analytics in the legal technology space.