SCOTUS Data Parsing Through GPT API

Introduction

In this project, we developed a comprehensive parsing algorithm to extract and process legal data from an Excel file using the GPT API. The primary goal was to automate the identification and categorization of individuals involved in legal cases, including their positions, law firms, cities, and states. By leveraging the power of multiple GPT-4 models, we aimed to ensure the accuracy and reliability of the parsed data through a multi-pass approach and conflict resolution mechanisms.

The parsing process is divided into three primary stages: parsing raw data through GPT, processing and comparing parsed data, and final processing on the compiled data. Each stage involves specific tasks and methodologies to achieve the desired outcomes. This document provides a detailed overview of the entire process, including the use of various GPT models, the handling of inconsistencies, and the final aggregation and extraction of unique names for further analysis.

By the end of the project, we successfully parsed a substantial dataset of legal case information, achieving a high degree of accuracy in identifying and categorizing individuals. This project showcases the capabilities of GPT models in data processing and highlights the importance of multi-pass approaches and conflict resolution in ensuring data integrity. The following sections detail each stage of the parsing process, from initial data input to the final extraction of unique names, providing insights into the methodologies and tools used throughout the project.

A Summary of the Parsing Algorithm

There are three primary stages of the parsing process:

Parse raw data through GPT
Process and compare parsed GPT data
Final processing on compiled data

1. Parse Raw Data Through GPT

1.1 Send data to GPT API

The first step of the algorithm is to read a data row from the Excel file and input that to the GPT API to parse. An example row (row 10) from the data is shown below:

Case Name	Docket No.	Parties (full case name)	Filing Type	Counsel	Filed date
FORD MOTOR CO. v. BANDEMER	No. 19-369	FORD MOTOR COMPANY, Petitioner, v. ADAM BANDEMER, Respondent.	Initial Brief: Intervenor-Appellant	NEAL KUMAR KATYAL, SEAN MAROTTA, Counsel of Record, KIRTI DATLA, ERIN R. CHAPMAN, HOGAN LOVELLS US LLP, Washington, DC, Counsel for Petitioner.	18-Sep-19

We read and extract the counsel string from the data so that it can be passed into the GPT API. We prompt GPT API with the following &quotSystem Instructions&quot:

SYSTEM_INSTRUCTIONS = ```
You are a data processing assistant. In the given data, identify all individuals, their position, law firm, city, and state. Return the data in JSON format. Every lawyer should be associated to a law firm. Fill any cell where data is not available as N/A. The JSON should contain a list keyed with `individuals` listing all of the individuals in the data. The keys for each individual should be: `name`, `position`, `law_firm`, `city`, `state`
```

The above string is sent to the GPT API as prompt instructions. Along with that, we send the string of the unparsed counsels. From that above example, the string sent to GPT API would be:

NEAL KUMAR KATYAL, SEAN MAROTTA, Counsel of Record, KIRTI DATLA, ERIN R. CHAPMAN, HOGAN LOVELLS US LLP, Washington, DC, Counsel for Petitioner.

This string sees no additional processing before GPT. It is sent in its raw format to the API. The GPT API uses a messages list to keep track of the chat history. The initial prompt is sent as a `system` instruction, and the unparsed string as a `user` message.

1.2 GPT Models Used For Parsing

A variety of GPT-4 Models were used for parsing. As the parsing was performed over the span of a few weeks, a few GPT models were released. We thought that it would be a good idea to run the data through different models and compare their outputs. As such, different parts of the data were run through various different GPT-4 models. The models we used were: `gpt-4-1106-preview` , `gpt-4-turbo-2024-04-09`, and `gpt-4o-2024-05-13`.

The following table describes how the various models were utilized in the parsing of the data. The groups show the ranges the data were parsed in.

Data Section	First Pass	Second Pass	Third Pass (conflict resolution)
0k - 5k	gpt-4-1106-preview	gpt-4-1106-preview	gpt-4o-2024-05-13
5k - 15k	gpt-4-1106-preview	gpt-4-1106-preview	gpt-4o-2024-05-13
15k - 30k	gpt-4-1106-preview	gpt-4-turbo-2024-04-09	gpt-4o-2024-05-13
30k - 45k	gpt-4-1106-preview	gpt-4-turbo-2024-04-09	gpt-4o-2024-05-13

Note: we chose not to re-run the second parse of the 0k-15k section of the data as we judged that the new model (`gpt-4-turbo-2024-04-09`), while different, did not provide a significant enough improvement to justify the cost of re-running the parser. However, it was different and provided enough improvement over simply doing two passes with the `gpt-4-1106-preview` that we decided that it was worth it to switch over the new model.

All conflict resolution was run with the `gpt-4o-2024-05-13` model as it provided a third--and new--model that was different from both passes.

1.3 Receive and Process GPT Data

We have instructed GPT to return the data in a JSON format, therefore, the result is very structured although GPT sometimes appends extraneous data in front of the open curly brace of a JSON object. We parse the return string and only keep the data contained between the first opening curly brace and the last closing curly brace.

The resulting JSON contains an `individuals` list containing nested JSONs with the key-value pairs corresponding to the requested data: name, position, law_firm, city, and state. The resulting JSON object returned by GPT from the above example is shown below:


{
  "individuals": [
    {
      "name": "Neal Kumar Katyal",<br>
      "position": "Counsel of Record",<br>
      "law_firm": "Hogan Lovells US LLP",<br>
      "city": "Washington",<br>
      "state": "DC"<br>
    },<br>
    {
      "name": "Sean Marotta",<br>
      "position": "Counsel of Record",<br>
      "law_firm": "Hogan Lovells US LLP",<br>
      "city": "Washington",<br>
      "state": "DC"<br>
    },<br>
    {
      "name": "Kirti Datla",<br>
      "position": "N/A",<br>
      "law_firm": "Hogan Lovells US LLP",<br>
      "city": "Washington",<br>
      "state": "DC"<br>
    },<br>
    {
      "name": "Erin R. Chapman",<br>
      "position": "N/A",<br>
      "law_firm": "Hogan Lovells US LLP",<br>
      "city": "Washington",<br>
      "state": "DC"<br>
    }<br>
  ]<br>
}

This data is written to a result JSON file along with the pertinent information from the original row for further processing. A unique JSON file is created for each row from the original data and is named by the row. Since our example came from row 10, the final JSON file is `10-info.json` which looks like:


{
  "individuals": [
    ... see above ...
  ],
  "original_row": 10,
  "case_name": "FORD MOTOR CO. v. BANDEMER",
  "docket_number": [
    "No. 19-369"
  ],
  "filing_type": [
    "Initial Brief: Intervenor-Appellant"
  ],
  "date": "2019-09-18"
}

2. Process and Compare Parsed GPT Data

2.1 Limitations of the GPT API

During our testing, we found that while usually accurate, the GPT API sometimes gave inconsistent or inaccurate results. They happened often enough that we decided that parsing the dataset through the API twice would be helpful with catching these mistakes. As such, after step 1, we ended up with two separate sets of uniquely generated data as the GPT. This uniqueness is a product of how the GPT algorithm works meaning results are never deterministic. Therefore the same input can and often does produce different result. We judged the probability of both runs making the same mistake to be quite unlikely, and therefore settled with running twice.

2.2 Aggregating the Results

Many stages were used when aggregating the results from the multiple GPT runs. When comparing the files, we identified five unique scenarios that required different levels of intervention to produce a singular *aggregated* output. These are:

Identical Profile: This was a trivial case. The two GPT runs produced two identical profile. In this case, we took the first file.
Identical Names: In this case, all names for a specific case (row) being parsed were identical, however some differences existed in the parsing of the other fields (position, law firm, city, state). In this case, since we found that GPT often differed slightly in formatting but not in content, we simply took the first file.
Highly Similar Names: In cases where the names were not identical, we utilized a *Fuzzy Partial Ratio* match from the The Fuzz Python package. For more information on how fuzzy matching works, see this article. We selected a partial ratio threshold of 85 or higher to designate a `highly similar name.` This means that while GPT may have generated slightly different variations of the name, a similarity score above 85 indicates that they refer to the same individual.
Name List Length Matches - Names Highly Different: There are two subclasses in this category: mismatched order, and completely different parses. The first category is a result of how the comparison algorithm compares index to index strictly. If a name appears as the first index in one parsing iteration and happens to appear in the second index in another iteration, it would result in a `highly different` mismatch. In this case, the comparison algorithm drops the strict index requirement and attempts to match to any index of the list.
Name List Length Mismatch: In cases where the length of the list of names for a given case did not match, we reran the row through the parsing algorithm. As GPT-4o had just become available when we were set to run this third pass, we chose to use it as to ensure that all comparisons to a third run were against a new third model instead of comparing with the same model. We then compared this third run with the existing two runs. If they matched, we chose the observations that agreed with each other and discarded the mismatching run.

We processed a total of 147247 names. The statistics of the number of names in each category are:

Category	Count	Percent
Identical Profile	113027	76.76%
Identical Names	28786	19.55%
Highly Similar Names	1780	1.21%
Names Highly Different	141	0.10%
Name List Length Mismatch - Matched With Tiebreaker	3513	2.39%
Name List Length Mismatch - Could Not Match (ROWS)	67	--

The final output of the comparison is exported to a JSON file containing all names parsed from the data.

3. Final Processing on Compiled Data

3.1 Finding Unique Names

At this stage, we have a list of 147247 names extracted from the council field through the process described above. For further analysis, we want unique names, and having the distinct first, middle, and last names available as separate fields for query.

3.2 Extracting First, Middle, and Last Name

Through a process similar to the initial parse, the list of unique names were once again sent through the GPT API. The system instructions were:


"Given this list of names, extract the first, middle, and last names along with any suffixes and prefixes if applicable. Format the data as json. Key the list of parsed names with people with each person having the keys: first_name, middle_name, last_name, prefix, suffix. Use 'N/A' if any key is not applicable"

This time, we determined that GPT 3.5, specifically `gpt-3.5-turbo-0125`, would be sufficient to process the data and therefore chose to use it both to save cost and time. This data was also exported as a JSON and later flattened to a CSV file for sharing.

Conclusion

The Parsing Algorithm Project exemplifies the power and versatility of GPT models in automating complex data extraction and processing tasks. In the project debrief, we discussed that before access to GPT models, this task would have to been performed manually, likely costing over \$10,000. The API usage of for processing this data ended up costing about \$1000. A substantial saving. By leveraging multiple iterations of GPT-4 models and employing a structured, multi-pass approach, we achieved a high degree of accuracy in parsing and categorizing legal case information from a large dataset. The methodology involved careful planning, model selection, and the implementation of conflict resolution strategies to ensure the reliability of the parsed data.

Throughout the project, we encountered and addressed various challenges, such as inconsistent outputs from the GPT API and the need for efficient data aggregation techniques. Our approach to running the data through multiple models and comparing results proved effective in mitigating errors and ensuring data integrity.

The final stage of the project involved extracting unique names and breaking them down into distinct components, such as first, middle, and last names, along with any prefixes or suffixes. This detailed level of data processing enables more granular analysis and facilitates further research and application.

In summary, this project not only highlights the capabilities of advanced language models like GPT-4 but also underscores the importance of thoughtful algorithm design and rigorous testing. By integrating multiple tools and methodologies, we successfully created a robust parsing algorithm that can serve as a foundation for future data processing projects. The insights gained and the methodologies developed during this project will undoubtedly contribute to ongoing advancements in automated data extraction and analysis.