Machines can analyse vast volumes of data at incredible speed, and accuracy beyond the cognitive range humans can muster. AI amplifies our capabilities and frees up work that requires uniquely human skills, such as professional judgment, creativity, and customer empathy. Simply put, machines and experts do more together than either can do alone.

Building a Legal Data Strategy

AI software is only as good as the data it analyzes. Where does all that data come from? How do we know it’s the data we need?

To answer these questions and many others, we need a data strategy: a comprehensive, top-down approach that organizes and manages an organisation’s data to effect impactful change. We want data on our cases and customers, data on personnel and communication, data about resources, and data about risk management and uncertainty. Any record of fact that we have, we can use to elevate our business.

We often view the different aspects of an operation as discrete and separate entities. Without a data strategy, we may miss the tiny patterns within the operations of a business, analysis that can take place if the organisation’s data is well organized. A legal data strategy focuses on developing data collection and capabilities to improve the quality and efficiency of an organization’s operations.

A good data strategy takes an inventory of an organisation’s goals and resources, both human and technological. Neither people nor tech tools can function well in an organisation where data management is not strong.

Types of Data

There are two types of data: structured and unstructured. Examples of structured legal data include billing records, docket records, and litigation outcomes. This data already possesses a semblance of organization. Unstructured data, on the other hand, typically doesn’t have any kind of internal structure. Examples of unstructured data include court filings, contracts, deposition transcripts, and other documents that are considered mainly for their internal content but are not often separate data points in a larger set. To bring structure to unstructured data like contract data, we can use AI.

Managing Legal Data

It’s easy to talk about data in the abstract because data is somewhat abstract. To convert unstructured data into useful, structured data, organizations should adhere to the following four elements:

      • Recognition: Is useful or relevant data recognized as such? What are we potentially ignoring?
      • Storage: Is relevant data being effectively organised and stored?
      • Publication: Are there directories of all data? How are these directories structured and organized?
      • Accessibility: Who has access to the data? How is the data used? Does everyone know how to find the data they need?

A good data strategy moves through these four stages in a cycle. Each aspect leads directly into the others, allowing for constant improvement and iteration.


The recognition stage of a legal data strategy is the first integration point between people and technology. Recognizing relevant, valuable data in your organization is not always easy, but many tech tools make it easier. In emails, Google, Outlook, and other services keep track of metadata and content. For contracts and other documents, solutions that scan data in documents can automatically recognize many types of data.

Using software for data recognition is important, but legal, technology, and business professionals must also work together to address what kinds of data the organization has, what types of data are needed, and whether current capture methods are getting the job done.


The second stage of the cycle is data storage. We gather data to make predictions about the world. The more data we collect and the longer we collect it, the more confident we can be that our predictions will be accurate; this is why data storage is so important. Organizing data in a meaningful way can be accomplished with technology tools, but having a data strategy creates the conditions under which the people in an organization actively store and manage their structured and unstructured data.


Once an organization has begun recognizing and storing data, the next stage of the cycle is to make the data appropriately accessible. Unstructured data needs to be processed to become structured data, and the application of standard forms or tags reduce redundancy and improve data integrity. Practical accessibility means having a clear system for retrieving, analyzing, extracting, transforming, and otherwise managing data.


The final stage of the cycle involves the actual usage of data to produce tangible results. Publication doesn’t necessarily mean that an organization is publishing and releasing company data for the whole world to see (although this often does happen in SEC filings, earnings reports, press releases, etc.) Instead, publication refers to maintaining a directory of data within an organization. A software tool can aid in this process. Attention to the default method of organizing data and the best way for your organization to communicate important data internally (e.g., some document management systems build helpful diagrams and charts to communicate an understandable message about a particular subject) is important.

Publication focuses on communication. What is our data telling us? What conclusions can we draw from an analysis of all this data? Many actionable insights come from this stage.

Data Science in Law

Data science uses analytical techniques to better understand, diagnose, forecast, and predict business outcomes. Virtually every industry now collects data about what they do and how they do it. This ubiquity of data collection means that the world is becoming increasingly quantified. The legal industry, while lagging behind others, is not immune from this trend. There are a variety of reasons why legal lags behind:

      • Data collection has been limited thus far, especially for law firms;
      • Few lawyers have quantitative/science backgrounds, so they are less aware of the value of data and haven’t conditioned their ‘data muscles’;
      • The legal industry is fragmented, and there is a wide array of work undertaken at a typical large firm or corporate law department; and
      • Up to this point, most systems that collect data have offered less-than-ideal user experiences, affecting adoption

Law departments and law firms that do not collect and analyze their data stores will be left behind when trying to mitigate risk, better predict and achieve outcomes, improve the customer experience, and reduce costs. Some forward-thinking corporate law departments and law firms have created formal roles for legal data scientists. We work with their legal and risk teams to design and implement data strategies and new techniques to harness value from their data.

Corporate law departments got an early start on law firms because in-house teams have amassed more data in corporate systems, and increasingly law is integrated into the business. Law firms have been catching up. Most are focused on modernizing reporting or have clients driving their activity. Opportunities for strategic, systematic investment abound, such as systematically tracking outcome and settlement data for all a firm’s litigation work. Data science and predictive analytics capabilities are starting to be prioritized more by law firm leadership.

Design Slide of Data Strategy Maturity Model
Discover Your Data Maturity Level
Design Slide on Data Maturity Process Stagewise
Discover Your Data Maturity Level
Design Slide on Data Maturity Process Stagewise
Discover Your Data Maturity Level
Design Slide on Data Maturity Process Stagewise
Discover Your Data Maturity Level
Design Slide on Data Maturity Process Stagewise
Discover Your Data Maturity Level
Design Slide on Data Maturity Process Stagewise
Discover Your Data Maturity Level

Using AI to Bring Legal Data and Data Science to Life

Natural Language Processing (NLP) in a Nutshell

Researchers and developers have been working on natural language processing (NLP) and machine learning packages for over twenty years. NLP tools are in all sorts of products, from search engines to customer service chatbots. Odds are, you’ll interact with an NLP technology at some point today, like Apple’s Siri or Amazon Alexa, that is programmed to understand simple queries. As NLP tech tools learn how to answer simple questions correctly, the data used to train them makes it possible for more complex queries.

NLP’s iterative acquisition of English proficiency is not too dissimilar from how people acquire language as children. First, we learn common, easy-to-use words and phrases, and then more vocabulary, syntax, and grammar develop through trial and error over a child’s first decade or so. Likewise, NLP tools use machine learning the same way, with developers feeding new data into the system and determining its progress through trial and error.

A simple Google search is another familiar example of NLP usage. When you type a question into Google’s search bar, that question gets parsed by NLP and transformed from a sentence a human understands into a machine-readable query that the search engine can interpret and answer.

The NLTK Toolkit for NLP

You may not have heard of the Natural Language Toolkit (NLTK), but you’ve almost certainly used an app or software package that uses this NLP library. NLTK is a Python-based NLP tool that works great with general knowledge, the NLP toolkit used by the everyday app. First built at the turn of the 21st Century, NLTK is used to find and extract text and parse that text according to carefully designed rules. Several universities are teaching NLTK around the world.

NLP software like NLTK works wonders for general text analysis but leaves much to be desired when it comes to more specific areas of expertise. Several NLP tools have already been tailor-made for the medical field, for example. These toolkits take NLP libraries like NLTK and train them further to extract and analyze language specific to their field. For example, in the medical field, these NLP tools would have terms describing diseases, or the names of various microbes, or obscure pharmaceutical words that most people seldom or never use. In legal, NLP tools like Elevate’s LexNLP contain all of the unique language specific to law.

LexNLP and Law

Researchers and developers have produced NLP tools for medicine and other fields, but there are few comparable tools for the legal field. The legal field is integral to society, employs millions of people worldwide, and – just like medicine – it has its own distinct, intricate lexicon. Yet, law has not seen much development of a comparable NLP tool.

Elevate has filled this gap with LexNLP. The design of LexNLP provides legal professionals with tools and data to work with real legal and regulatory text, including statutes, regulations, court opinions, briefs, contracts, and other legal work product. LexNLP is a Python-based toolkit just like NLTK and operates in a similar manner. In terms of the ability to find, extract, and analyze text, think of NLTK as an undergraduate with a bachelor’s degree in English. At the same time, LexNLP is a law school graduate with several years of experience at a firm. Just like an undergraduate who goes to law school, LexNLP takes NLTK’s natural language library and adds additional terms, grammatical structures, and other elements specific to legal and regulatory documents.

LexNLP is an open-source Python package and is one of the tools that powers Elevate’s ELM Analyse Documents software. The training of LexNLP includes several different databases of legal material, including the SEC’s EDGAR database.

For an example of how LexNLP specializes in the particulars of legal texts, one can look at how legal documents handle numbers. Frequently, numbers in legal documents are words rather than digits. LexNLP accurately interprets these elements of a document, making sure that “onethousand fifty seven,” “one thousand fifty seven,” “one thousand and fifty seven,” and “one thousand fiftyseven” are all read as “1057”.

LexNLP uses several different methods to process legal and regulatory text into a machine-readable format:

Stopwords:  Stopwords are short words like “the” or “of” that appear in literally every document produced in the English language. However, legal and regulatory texts often use more specialized stopwords, some of which are from other languages (most notably Latin). LexNLP can detect and parse these rarer, niche stopwords.

Collocations:  Like stopwords, we use collocations every day. Any word that frequently pairs with another word or words is called a collocation. For example, there’s one in the previous sentence (“pair(s) with”). Like stopwords, legal and regulatory texts often use collocations unique to the legal field not covered by more generalized NLP toolkits. LexNLP, however, is trained to spot these unique examples.

Segmentation:  LexNLP training includes finding and distinguishing titles, headings, sub-headings, sections, and paragraphs from context.

Tokens, Stems, and Lemmas: LexNLP uses the Treebank tokenizer, Snowball stemmer, and WordNet lemmatizer. LexNLP can also be customized to function with the Stanford NLP toolkit.

Parts of Speech:  LexNLP can find verbs, adjectives, adverbs, and nouns.

LexNLP can also find and extract key text and structural information elements; including addresses, amounts, citations, conditional statements (e.g. “at least,” “within”), dates, definitions, distances, durations, money/currency, percentages, ratios, regulations, trademarks, and URLs. LexNLP can also extract named entities, such as companies, countries, NGOs, and other geopolitical entities. These text features transform into data for model training to build supervised and unsupervised data models.

Going Forward: Using AI in Law

Lawyers and other legal professionals need tools like LexNLP to turn real, unstructured legal documents into structured data objects. LexNLP can help a legal organization manage the chaos of its data and turn its documents into helpful, machine-readable texts. The full paper “LexNLP: Natural Language Processing and Information Extraction for Legal and Regulatory Texts” is published on SSRN.

New possibilities arise when legal professionals harness the power of AI to process the volumes of unstructured, complex information faster and more thoroughly. Elevate has embraced integrating AI technology into every area of our software and services solutions, using data science expertise and proprietary machine learning capabilities. Unlike other providers that apply AI as a workflow outside their core solution, we’re making data transparency between law departments, law firms, and law companies easier. By arming our customers with ‘just-in-time’ insights using artificial intelligence, we’re making informed, insightful decision-making possible for our customers.

Contact Us