The Cambridge Law Corpus: A corpus for legal AI research

We are excited to introduce the Cambridge Law Corpus (CLC), a comprehensive collection of court cases from the UK that serves as a valuable resource for legal AI research. Covering a vast range of time, from the 16th century to the present day, the corpus consists of over 250,000 cases. In this paper, we present the initial release of the corpus, which includes the raw text and accompanying meta-data. Additionally, we have enlisted the expertise of legal professionals to annotate case outcomes for 638 cases, enabling us to train and evaluate the performance of various models such as GPT-3, GPT-4, and RoBERTa. Noteworthy is our thorough exploration of the legal and ethical implications surrounding this sensitive material, leading to a limited release of the corpus solely for research purposes.

https://arxiv.org/abs/2309.12269