CharacterTextSplitter is not utilizing the chunk_size and chunk_overlap parameters in its split_text method. Instead, it’s splitting the text based on a provided separator and merging the splits. This could potentially lead to chunks of text that do not adhere to the specified chunk_size and chunk_overlap.
On the other hand, RecursiveCharacterTextSplitter does take into account these parameters. Its split_text method recursively splits the text based on different separators until the length of the splits is less than the chunk_size. This approach seems more aligned with the intention of creating text chunks of a specific size and overlap.
In [ ]:
from langchain.schema.document import Document
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
In [ ]:
doc1 = Document(page_content="Just a test document to assess splitting/chunking")
doc2 = Document(page_content="Short doc")
docs = [doc1, doc2]
docs
Out[ ]:
[Document(page_content='Just a test document to assess splitting/chunking'), Document(page_content='Short doc')]
In [ ]:
text_splitter_c = CharacterTextSplitter(chunk_size=30, chunk_overlap=10)
texts_c = text_splitter_c.split_documents(docs)
texts_c
#Notice no overlap even after specifying Overap
Out[ ]:
[Document(page_content='Just a test document to assess splitting/chunking'), Document(page_content='Short doc')]
In [ ]:
text_splitter_rc = RecursiveCharacterTextSplitter(chunk_size=30, chunk_overlap=10)
texts_rc = text_splitter_rc.split_documents(docs)
texts_rc
Out[ ]:
[Document(page_content='Just a test document to assess'), Document(page_content='to assess splitting/chunking'), Document(page_content='Short doc')]
In [ ]:
max_chunk_c = max([ len(x.to_json()['kwargs']['page_content']) for x in texts_c])
max_chunk_rc = max([ len(x.to_json()['kwargs']['page_content']) for x in texts_rc])
print(f"Max chunk in CharacterTextSplitter output is of length {max_chunk_c}")
print(f"Max chunk in RecursiveCharacterTextSplitter output is of length {max_chunk_rc}")
Max chunk in CharacterTextSplitter output is of length 49 Max chunk in RecursiveCharacterTextSplitter output is of length 30