Skip to main content
Find a Lawyer

Predictive Coding Primer

What is Predictive Coding?

The easiest way to describe predictive coding is to clearly state what it is not: it is not a replacement for human attorneys examining information during discovery. Predictive coding (also known as technology assisted review) heavily leverages statistics to reduce the time needed to examine large document sets that in the past may have taken a group of attorney reviewers weeks or months to review at great cost. The focus in predictive coding is to gather a representative sample (or "seed") set from a large data population and carefully "train" technology systems using input from experienced, knowledgeable attorneys on which documents are "responsive" or "non-responsive" to the issues in a particular case. Following this training process, sophisticated technologies take the decisions from the sample set and apply them to the entire data set, generating significant time savings. While it is still early in the game, it is clear that predictive coding will be an important tool in the eDiscovery toolbox, particularly for matters involving large amounts of data.

Benefits of Predictive Coding

In some cases, experienced document reviewers can review 100 or more documents per hour, which means a data population of 500,000 documents would take a team of 20 such reviewers 250 hours to complete (that represents 6 standard 40 hour workweeks). Once seed sets are established and trained, predictive coding technology can process the same 500,000 documents in a significantly shorter time, allowing for a more targeted review of only the documents most likely to be responsive.

While human attorney review is the accepted practice for legal review, some studies show that recall may be as low as 50%-70% with traditional attorney review, meaning that only 50%-70% of all relevant documents within the entire data population were identified by the attorney reviewers. Technology based systems do not suffer from the human conditions of distractibility, loss of focus, or inconsistent decision-making and studies from the TREC Report in 2011 showed recall rates with predictive coding to be higher than human counterparts.

Predictive coding reduces the number of attorneys needed and the amount of time these attorneys spend reviewing documents, allowing technology to capitalize on human input and training to significantly cut review costs for the appropriate types and sizes of matters. The amount of savings is dependent on the degree to which the user is willing to rely on the automated coding decisions and the portion of the data that is likely to need human review.

Risks of Predictive Coding

Many discovery service providers and software companies offer their own variations on predictive coding, causing the specific steps involved in using the technology to vary.

One of the critical issues in predictive coding is how the technology identifies the sample set that is going to be used by attorneys to train the technology for use with the remaining majority of the data population. There are different takes on seed set creation; today, there is no standardization. Experts use terms like random sampling or stratified sampling to address issues about whether sample documents are taken randomly from the entire data population or if groups of documents are culled across all custodians to ensure everyone is represented in the seed set. It is also worth noting that review of the seed set must be conducted by senior lawyers familiar with the matter, which shifts the highest cost resources to be beginning of the eDiscovery process. This requirement necessitates a thorough cost-benefit analysis before predictive coding is employed.

Certain types of documents or evidence do not work as well with this type of learning-based system. Audio and video evidence cannot currently be handled and OCR'd documents present challenges. Compound documents such as a board of directors meeting can be confusing to technology because of the multitude of themes that may be present within a single document. Financial documents such as Excel spreadsheets or invoices where there are few terms may contain key evidence but are not easy to relate to the rest of the document universe.

"Grey areas" involve shading of degrees between privileged documents, or what is permissible/legal and what is potentially violative or gives rise to a cause of action. Currently, most every predictive coding tool limits the attorney training the system via seed set to mark documents as "responsive" or "non-responsive" thus wiping out the careful reasoning behind decisions relating to matters of degree or what might cross the line of permissible/impermissible activity that is central to a legal matter.

Workflows (or How Does It Actually Work)

The possibility of faster results, potentially more accurate results, and most especially cost savings, is appealing. Without any type of standardization across predictive coding platforms, there is understandable hesitation to adopt unproven technologies that the majority of courts have not yet endorsed. As more cases come through the court system, litigants will gain insight and instruction into what the courts believe is appropriate and what fails to pass muster in terms of specific workflow combinations.

For now, the key workflow issues that must be considered when using predictive coding include:

How many documents are going to be included in the seed set? How is it culled from the data population? Who is going to review the seed set (senior partners, the lead attorney on the case, senior associates)?

How many training iterations will be required to avoid missing key information? What level of precision and recall is appropriate?

What percentage of the responsive documents identified by predictive coding will be further reviewed by humans? How will predictive coding identify and manage potentially privileged documents? What will courts consider to be reasonable steps to ensure no waiver of privilege through accidental production?

How do firms address the issue that most senior attorneys that have the context and understanding of strategy don't necessarily have the time to sit and review document seed sets? Can the legal team wait patiently for the bulk of documents to be amassed and processing completed?


Predictive Coding offers the promise of faster, more consistent results at lower cost than traditional linear review of an entire document population by groups of attorneys. There are significant benefits and risks to predictive coding, and only a few courts have had the opportunity to evaluate and balance this equation. It will continue to be important to weigh the cost, risk, defensibility of the approach and the importance of the matter before selecting any technology solution.

Was this helpful?

Copied to clipboard