Finding PII and PHI: Three Ways AI Leads to Better Outcomes

Published on February 20, 2020

Keith Wilson

As a transplant from eDiscovery software to eDiscovery service, Keith brings a deep understanding of automation to eDiscovery process design. He has worked with Fortune 500 and AmLaw 100 clients to gain efficiencies in the redaction process and alternative eDiscovery workflows.

Because of increasing data volumes in eDiscovery, there’s more Personal Identifiable Information (PII) and Personal Health Information (PHI) data than ever before. That data needs to be identified so it can be isolated, redacted or deleted. Traditionally, in litigation settings, that’s been a monumental task.

With so much data, litigators and their end-clients have historically only had two options for identifying where PII might be: either spend a lot of time and money to review every document that will be produced for potential redactions or rely on regular expressions (RegEx) or take a calculated risk and accept that while RegEx will catch most PII/PHI, some may inadvertently get produced or put in the public record.

Neither option was good. Either litigators and their end-clients had to spend a ton of cash for a comprehensive review, or they had to accept that part of the cost was a high likelihood of reputation damage due to the inadvertent production of PII.

But now, with AI, there’s a new option. And it’s time to reevaluate how you think about the economics of redaction. RegEx is no longer the best way to get the outcomes you need—AI is. And while it can seem like a black box, it’s actually a much less risky approach that can give you a holistic view of relevant data.

With AI, you can efficiently cross-check what needs to be redacted, what your review process has missed, or what has been redacted by mistake. Now, I’ll explain how traditional RegEx software can fail, and provide some concrete examples of how AI-enabled software can provide a solution.

1. AI Can Evaluate Context to Identify PII That Does Not Follow the Typical Regex Format

RegEx have always been an efficient way to find PII that is in a standard format. It’s often used to find phone numbers, SSN, addresses and more. The problem with finding PII with RegEx is that the data needs to be in that exact format to be found.

Take for example addresses. Addresses vary so much in how they are formatted that it becomes impossible to write a RegEx that will find all of them.  An AI-enabled software program can solve that problem, though. Instead of having to list out every possible address formation, we can teach the AI on the normal RegEx, showing it what is an address and what isn’t by what’s actually written and what context surrounds it.

If you’ve ever typed an address into Google, you know what I’m talking about. How many times have you typed some combination of numbers and street names into Google and had the search engine’s AI figure out exactly where you were talking about, misspellings or abbreviations and all?

With Entity Modeling, a feature of the NexLP software that builds on the same AI we use for technology-assisted review (TAR) and applies it to PII, you can train different models on different pieces of PII. For any given case, you can apply only the models that correspond to the regulations you need to follow for that case.

Once the NexLP’s PII modeling’s AI knows what it’s looking for, it can search for documents that may have that PII, identifying the information you were looking for in all its various forms.

2. AI Can Prevent Over-redaction Risk, Where Names Are Also Commonly Used Words (E.G. Mr. Brown and Brown the Color)

Searching for names, and variations that the names come in have always been the standard when looking to redact names, or product names that need to be redacted. The problem comes into play when the name or product name is also another word. Brown is sometimes a last name and thus may need to be redacted as part of PII. But “brown” can also be a word used to describe the color of a coat or describe the action required to cook onions, and those other uses wouldn’t fall under PII and wouldn’t need to be redacted.

Under-redacting can cause problems with regulations and rules, but over-redacting is also not ideal. It can get you into trouble with your opposing counsel or the court, and can present an expensive risk. Standard software can confuse those uses, and an eyeballs-on-documents approach requires significant resources. Context-aware AI can sort through those different contexts like a human, but much faster and much cheaper.

3. AI Can Help You Find PII That You Might Not Know Is in the Document Set

There are times that you will need to redact anyone not relevant to the litigation, but you don’t have a reliable master list to reference. Or you need to comply with a law that requires you to anonymize the names of students or other names, but no comprehensive list of names exists. What are your options?

Historically, you were looking at a long and tedious manual review, or trying to custom-program current software. An AI-enabled solution can automatically create a master list of names. The AI model combs the document, using the context it’s been trained on to identify the right names.

This way, instead of reviewing every document yourself, your trained AI model will do that work for you, giving you a holistic view of the exact information you need.

Final Thoughts

I’ve had conversations about redactions on everything from high profile cases to DSARs, with people from corporations, law firms and most vendors in the space. Redactions are expensive, and probably the least favorite thing of any reviewer. I’ve seen a dramatic switch in how people are thinking about PII/PHI.

I’ve mentioned some great new ways to find PII/PHI above, but I also think that the people working on it and the process are also important. Some of the workflows above, if used incorrectly, could lead you to producing PII/PHI.

I’d love to hear if you have other workflows to find PII/PHI.


Be Sure to Follow Me for the Latest Content and Subscribe For the Latest Acorn Insights! 

About Acorn 

Acorn is a legal data consulting firm that specializes in AI and Advanced Analytics for litigation applications, while providing rigorous customer service to the eDiscovery industry. Acorn primarily works with large regional, midsize national and boutique litigation firms. Acorn provides a high-touch, customized litigation support services with a heavy emphasis on seamless communications. For more information, please visit