Over the past years we’ve heard increasing complaints from clients of a lot of duplicates in their hosting and review databases. When we look into the matter further, these “duplicates,” although substantively the same content, are not exact copies of each other from an evidentiary perspective. For example, migrating email to an archiving system might cause slight changes to the formatting of the header that results in the emails not being viewed as duplicates by the processing platform.
We are seeing an increasing need for defensible, technology-based deduplication beyond just MD5 hashing. The industry standard is to use MD5 hash to identify any duplicates of a document that has been processed and suppressing them, saving clients the time, money, and headache of reviewing duplicative data. This worked fairly well for a long time, but as email and archive systems have continued to evolve, the MD5 hash is no longer as effective at identifying duplicates as it once was. I want to lay out some of the business practices that are causing challenges with the MD5 hashing strategy for deduplication and propose that we start using Message-ID in addition to metadata as a supplement to deduplication.
Culprit 1: URL Rewriting
We’ve all seen the typical phishing email. It is usually formatted in a non-standard way, or requesting an invoice or documents we’ve never heard of, or from a person we know wouldn’t be sending or requesting this information. Those are the obvious ones. Every year we see them, all slightly different but always evolving to appear authentic. Phishing scams typically go about disguising hyperlinks to appear authentic on the surface but will send users to a malicious site. One way to bypass this bait and switch is to hover over the hyperlink, and see if you are being directed to the location you should be. One example I have seen is an email or text from my bank that is asking me if I authorized a charge and includes a link to log in and approve that charge, however when I look at the link it wants me to click it has nothing to do with my bank.
A security feature you may not even be aware of that is working hard in the background is URL rewriting. IT security professionals created this as a mechanism to catch these attacks before users can click on the malicious link. When an email is received, URLs are rewritten in the body of the email. When a user clicks the link, it passes through a service where it is analyzed for security before sending the user to the original link location. If the link is determined to be bad, the site is blocked. While this is an excellent technology to thwart malicious attacks, URL rewriting can create unique hyperlinks in each email copy making them unique when hashing.
Culprit 2: External Email Warnings
Another feature used to protect against phishing emails is the use of external email warnings. Attackers will emulate an email to look like it is coming from a trusted associate or organization. By putting a warning at the top of the email that it was received by an external source, readers can be aware that the email did not originate from inside the organization before clicking any links. This external warning message will also make an email unique when hashing. The MD5 hash takes into account the body and header of an email during processing, the system is going to think that this email is now a unique document even if the only difference between that email and the sender’s email is “External” at the top of the email. Not to say you shouldn’t implement these policies, but many companies have instituted this policy wholesale and may not have thought about the discovery implications on cost to review and size to host, never mind the frustration of having to review multiple copies of the same document.
Culprit 3: Emails Rebranded
In addition to phishing, email marketing strategies are another culprit of email duplicates. A prominent marketing innovation companies have started using is centralized signature management. This allows a company to customize signatures for marketing campaigns by adding or omitting information, promotions, content, etc. from a centralized third-party platform. These signatures are not integrated directly into the users email account but are applied when the email is sent. The sender’s copy doesn’t have the signature, but a recipient’s email copy does, even within the same organization. This creates unique copies of an email just due to the signature block being different between the two versions even though ostensibly this is the exact same document. Consider how this might affect your electronic discovery obligations and timelines if you must review multiple copies of the same document. It can quickly become an exponentially growing headache for companies.
What Can We Do About It?
Acorn has been working behind the scenes to support our clients in the never-ending duplication struggle. We have developed custom proprietary tools to analyze email data and identify duplicates using a combination of the Message-ID and additional metadata fields. Message-ID is a property added to outgoing emails by the sending mail system. It is retained in the recipient’s copy of the email as well. It isn’t always unique and it isn’t always populated which is why it shouldn’t be the only method of deduplication. This property in association with other metadata can identify duplicative emails. In one project alone, we were able to reduce the review population by 15%. This translated into roughly a savings of $40,000 by reducing the number of documents to review.
I’m not here to argue the efficacy of using the Message-ID and metadata for deduplication purposes, I will leave that to the attorneys. What I can verify is the increasing number of duplicates in email collections that need to be managed.
Acorn is a legal data consulting firm that specializes in AI and Advanced Analytics for litigation applications, while providing rigorous customer service to the eDiscovery industry. Acorn primarily works with large regional, midsize national and boutique litigation firms. Acorn provides a high-touch, customized litigation support services with a heavy emphasis on seamless communications. For more information, please visit www.acornls.com.