In the context of records management, autoclassification is essentially a programmable if/then statement. Many vendors tend only to emphasize the efficiencies of the “auto” part; namely, that it is transformative for a records management program to automatically categorize electronic records. However, these vendors often do not give a fair impression of the front-end work required to actually achieve automation, or in other words, to set up those if/then statements.
In fact, many practitioners have been led to believe that autoclassification is as straightforward as plugging in a technology to a data source or loading data into a dedicated application. After that, “artificial intelligence” automatically categorizes the data while everyone gets to go on vacation. It is vendors like those which have given autoclassification a bad name. For the interested practitioner, it is worth taking a more practical look at autoclassification, what it is, and how much work is involved before you see a return on investment.
What are the Prerequisites for Autoclassification?
When it comes to data classification, especially unstructured data classification, automation has the same requirements as in any other discipline. Programmatically building your organization’s if/then statements and coding exemplar document sets for each must take place at the outset and must be done correctly for automation to work as intended. If those inputs are not up to standard, then instead of replicating value at scale, all you will accomplish is replicating mistakes at scale.
If the promise of autoclassification is still attractive to you, we have written a series of posts that will help you understand the fundamentals, some particularly useful technologies, how to think about ROI, and how to go about measuring it.
When it comes to data classification, we can effortlessly imagine communicating rules to other human beings:
“When you receive an invoice, give it this tag in your email.
Once it gets paid, put it in this folder.
3 years after it has been paid, you should delete it, unless it’s on legal hold, then don’t delete it.”
Nothing about the language itself is particularly complex. However, relying on humans to remember all those ifs (i.e., invoice received, invoice paid, invoice paid plus three years, invoice on legal hold, etc.) and thens (what they are supposed to do in all those situations) over the course of multiple years in differing circumstances can be problematic. When you consider the number of times humans need to remember and apply those actions reliably, you get a quick and visceral understanding of why manual records management has a difficult time keeping up with modern data growth.
One familiar example of automated if/then technology is the feature set built into many email clients. For instance, a system may automatically create calendar events using unstructured text contained in the body of an email. Another example is Out of Office notifications: if mail is received between a specified date range, the system can automatically send a form message in response to the sender.
Autoclassification essentially proposes taking the onus of the if/then logic off of the human brain and applying it programmatically, thus introducing scalability and reliability to a process that desperately needs both. A simplistic version of this automated if/then classification can also be found in email clients, namely foldering, or the ability to set a rule that checks the To: field of an incoming email, and if a certain value is found, automatically place the email in a certain folder in the mailbox. Autoclassification is a powerful set of tools that tries to accomplish the same kind of efficiency gains in the realm of records management generally.
One of the most important factors that determines the success of autoclassification technologies is what parameters can be used to articulate the if. In other words, what you are able to tell the technology to look for that will then inform a certain trigger or enforcement action.
First, imagine you are talking to a human, and you instruct the human that once an invoice has been paid, he needs to put it in a specific folder in a particular SharePoint site associated with the client who paid the invoice. Now, imagine trying to automate that process. What attributes would the technology need to decipher? It would need to be able to tell the difference between an invoice and any other document on the system, specifically account receivable invoices compared to accounts payable ones. It must also understand the difference between when an invoice has been paid and when it has not, and determine the client with which it is associated.
Apart from document attributes, in order to detect a relevant document in the first place, the autoclassification technology needs permission to access its initial storage location. Where a document is stored will also have a significant impact on how many of those ifs will be readily available for an autoclassification technology to leverage.
In the example above, if the invoice is sitting in a dedicated billing system, then much of the relevant information will likely be stored as metadata. If Client Name is a defined field, the autoclassification system has a key to understand with certainty where the invoice is supposed to be stored in SharePoint. Even though the client’s name may be in the document itself, it is much more difficult to find it as a loose word in the document than if it is pre-populated as a value in a dedicated field.
Now imagine instead that the invoice is created in a word processing application, turned into a PDF, printed, sent to the client in hardcopy, and the client simply makes a payment to the remittance contact referencing the invoice number. The autoclassification technology has a lot less to go on, since the ifs are not as obvious, and they exist in disparate systems. However, these types of nuances can be learned by the autoclassification engine (through training) to ensure that documents are treated appropriately.
Just as important as the ability to recognize an if is the system’s ability to execute a then. Some examples of then transformations include:
- Add/remove additional metadata
- Move/copy the file
- Delete the file
- Enforce the retention or preservation of a file
- Create a notification
- Start a workflow
- Add an encryption
- Change permissions
- Compile a report
- Alert other applications
These transformations can be standalone or combined with others. You can essentially think of any action a human would take as the list of actions that should be considered possible thens for the autoclassification technology to enforce. In a mature and flawless information management program, this is a long and complex list indeed.
Where the documents in question are meant to reside in the long term is just as important as where the documents are initially stored. If the technology can take all of the above actions but only within a single application, the technology has limited utility if you store data in many sources.
The next article we will examine some specific autoclassification technologies that exist and discuss their strengths and weaknesses, along with commentary on how to approach evaluating them.