I’ve had to do a bunch of legal reading lately and ChatGPT had been helping a little so I thought it would be a nice idea to use Retrieval Augmented Generation (RAG) to supplement GPT-4’s understanding of New Zealand law - in the hopes it could pass that understanding on to me.

Langchain and LlamaIndex are the two most popular packages to get you started with RAG and I tried both. Honestly they both have their issues but I found myself moving a bit faster with LlamaIndex so went with that.

If I was building a serious product I’d likely prototype using one of these and then shift on to using my own tools pretty quickly. The layers of complexity to ensure flexibility over relatively simple ideas isn’t quite worth it.

Python with Jupyter Notebooks in VS Code are pretty awesome these days for working iteratively on a problem so I dived in.

I’ll provide examples using the Fair Trading Act, which is relevant for businesses and consumers in New Zealand.

Getting moving, Getting Data

For RAG, you need data. So I had to grab what I needed from the New Zealand Legislation website. Python’s web scraping is pretty good with requests+BeautifulSoup so I was able to build myself a little library to parse the web page into some JSON.

Typing in Python is also pretty good these days, so each Act got parsed into the following Dicts:

from typing import TypedDict, List

class ActProvision(TypedDict):
    number: str
    title: str
    sub_provs: List[str]
    history: List[str]

class Act(TypedDict):
    title: str
    front: str
    url: str
    provs: List[ActProvision]

This can then be dumped out as JSON easy enough:

 {
    "title": "Fair Trading Act 1986",
    "front": "Title [Repealed]Title: repealed, on 18 December 2013, by section 4 of the Fair Trading Amendment Act 2013 (2013 No 143).",
    "url": "https://legislation.govt.nz/act/public/1986/0121/latest/whole.html",
    "provs": [
        {
            "number": "1",
            "title": "Short Title and commencement",
            "sub_provs": [
                "(1) This Act may be cited as the Fair Trading Act 1986.",
                "(2) Except as provided in section 49(3), this Act shall come into force on 1 March 1987."
            ],
            "history": []
        },
        {
            "number": "1A",
            "title": "Purpose",
            "sub_provs": [
                "..."
            ]
        }
    ]
}

The structure of the Act’s varies a bit so I ended up with a few cases to look for and parse through.

That’s why I published the little library nz_legislation so you maybe don’t have to do it again yourself. Just don’t abuse the website.

Pulling it into LlamaIndex

So having the data in a nice pretty format is good, time to throw it into LlamaIndex.

Except it’s not that easy. Parsing your content into meaningful chunks is a hard part of RAG.

If you convert it all into .txt files, throw it in a folder and run a basic RAG on it like this:

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)

# load the documents and create the index
documents = SimpleDirectoryReader("txt_data").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir='./txt_storage')

query_engine = index.as_query_engine()
response = query_engine.query("When is it legal for a business to contract out of its obligations under the Fair Trading Act?")
print(response)

You get okay results. The content is split based on sentences so a lot of context is lost in something like legislation.

Sentence splitting works well for some types of data but you may be surprised how it doesn’t work quite as well as you expect.

How does LlamaIndex use the sentences

If you aren’t familiar with the concepts of RAG you may wonder what sentences are passed given to the LLM and how they are decided. In short:

Each sentence is translated into a list of numbers that makes sense to a LLM which is called an embedding.
All the sentences are stored in a Vector database
When asked a question, we turn our question into an embedding
The vector database is searched for similar embeddings
The top results are given to the LLM as context to answer the question with.

I like to call it searching based on AI vibes.

Better chunking

So I decided to try chunking the documents up in a better way. The Legislation is very structured and each provision focuses on one topic quite strongly.

This is one spot where I am surprised LlamaIndex didn’t have much better tooling and explanations. Writing your own NodeParser isn’t very hard but the docs barely explain it at all.

I dived through their MarkdownParser example first which deals with text documents but ended up writing my own to extend the base NodeParser.

I won’t share the exact code (more on that later) but it was something along the lines of this:

from typing import Any, Dict, List, Sequence
from llama_index.node_parser.interface import NodeParser
from llama_index.schema import BaseNode, TextNode, NodeRelationship
from llama_index.utils import get_tqdm_iterable

import tiktoken
import spacy

# Requirements:
# $ pip install llama-index tiktoken spacy 
# You will also need the spacy data:
# $ python -m spacy download en_core_web_sm 

# Example data
# ------------
# {
#   "summary": "This is a summary",
#   "sub_items": [
#       {
#           "title": "A title",
#           "contents": [
#               "Some list content",
#               "Some other list content"
#           ]
#       }
#   ]
# }

class JsonSplitter(NodeParser):
    max_split_tokens: int = 512

    @classmethod
    def class_name(cls) -> str:
        return "JsonSplitter"

    def _parse_nodes(self, nodes: Sequence[BaseNode], show_progress: bool = False, **kwargs: Any) -> List[BaseNode]:
        nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")
        all_nodes: List[BaseNode] = []
        
        for node in nodes_with_progress:
            nodes = self.get_nodes_from_node(node)
            all_nodes.extend(nodes)
        
        return all_nodes

    def get_nodes_from_node(self, node: BaseNode) -> List[BaseNode]:
        encoder = tiktoken.Encoding = tiktoken.get_encoding("cl100k_base")
        nlp = spacy.load("en_core_web_sm")

        ret_nodes: List[BaseNode] = []

        your_data = json.loads(node.get_content())

        base_metadata = {}

        # Build a base node
        ret_nodes.append(self.build_text_node(
            text=your_data['summary'],
            parent_node=node,
            metadata={
                **base_metadata,
                "section": "root"
            }
        ))

        # Insert each sub item
        for sub_item in your_data['sub_items']:
            sub_item_metadata = {
                **base_metadata,
                "title": sub_item['title']
            }

            # Check if entire contents will fit into one chunk
            sub_prov_text = "\n".join(prov['contents'])
            if len(encoder.encode(sub_prov_text)) < self.max_split_tokens:
                # Insert the provision
                ret_nodes.append(self.build_text_node(
                    text=sub_prov_text,
                    parent_node=node,
                    metadata={**base_metadata, **prov_metadata},
                ))

                continue

            # Insert each sub-provision with respect to max_split_tokens
            for sub_prov in prov['contents']:
                token_count = len(encoder.encode(sub_prov))
                if token_count < self.max_split_tokens:
                    ret_nodes.append(self.build_text_node(
                        text=sub_prov,
                        parent_node=node,
                        metadata={**base_metadata, **prov_metadata},
                    ))
                else:
                    # Split into sentences and add one at a time till satisfied.
                    spacy_doc = nlp(sub_prov)
                    sentences = [sent.text.strip() for sent in spacy_doc.sents]
                    chunk = ""
                    for sent in sentences:
                        if len(encoder.encode(chunk + sent)) > self.max_split_tokens:
                            ret_nodes.append(self.build_text_node(
                                text=chunk,
                                parent_node=node,
                                metadata={**base_metadata, **prov_metadata},
                            ))
                            chunk = sent
                        else:
                            chunk += sent

                    if chunk != "":
                        ret_nodes.append(self.build_text_node(
                            text=chunk,
                            parent_node=node,
                            metadata={**base_metadata, **prov_metadata},
                        ))
                    
        return ret_nodes
    
    def build_text_node(self, text: str, parent_node: BaseNode, metadata: Dict[str,str] | None = None) -> TextNode:
        node = TextNode(
            text=text,
            excluded_embed_metadata_keys=(parent_node.excluded_embed_metadata_keys + self.exclude_metadata_fields),
            excluded_llm_metadata_keys=(parent_node.excluded_llm_metadata_keys + self.exclude_metadata_fields),
            relationships={NodeRelationship.SOURCE: parent_node.as_related_node_info()},
        )

        if metadata:
            node.metadata = {**node.metadata, **metadata}
        
        return node

On not giving legal advice

So it sort of works pretty well. But maybe not well enough to be relied upon entirely, as with most LLMs at the moment.

But legal advice is quite important and serious. Which means you also can’t give legal advice without some accountability.

I am not a lawyer and I don’t want to be. I think you could convey very well to users that THIS IS NOT LEGAL ADVICE. But even if you do, this project doesn’t really go anywhere from there.

Having to dive into the legislation on a particular problem you are having is hardcore and not something most people will ever need to do. Organisations like the Citizens Advice Beaureu have great FAQs on their site that convey plenty of information in an understandable way which covers most peoples questions.

What’s next?

I’ve got some other LLM projects I want to work on now, they’re a bit more product focused so hopefully they’ll turn out good.

Big thanks to Georgia W. for answering my questions about law stuff.

Table of Contents

Background
Langchain vs LlamaIndex
Getting moving
Scraping in the content
Indexing it well
The embedding problem
Trying it another way, ToC
Comparing the two
Other approaches
- Keyword + embedding
- SummaryIndex
What next?

Plus AI and LLMs are a pretty important skill to pick up these days.