examples: add langchain-chroma example (#248)

2024-06-07 19:40:48 +00:00 · 2023-05-12 22:20:07 +02:00 · 2023-05-12 22:20:07 +02:00 · 557ccc5ad8
commit 557ccc5ad8
parent 2488c445b6
9 changed files with 152 additions and 1 deletions
--- a/examples/README.md
+++ b/examples/README.md
@ -65,7 +65,7 @@ Run a slack bot which lets you talk directly with a model

 [Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/slack-bot/)

-### Question answering on documents
+### Question answering on documents with llama-index

 _by [@mudler](https://github.com/mudler)_

@ -73,6 +73,14 @@ Shows how to integrate with [Llama-Index](https://gpt-index.readthedocs.io/en/st

 [Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/query_data/)

+### Question answering on documents with langchain and chroma
+
+_by [@mudler](https://github.com/mudler)_
+
+Shows how to integrate with `Langchain` and `Chroma` to enable question answering on a set of documents.
+
+[Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/langchain-chroma/)
+
 ### Template for Runpod.io

 _by [@fHachenberg](https://github.com/fHachenberg)_
--- a/examples/langchain-chroma/README.md
+++ b/examples/langchain-chroma/README.md
@ -0,0 +1,54 @@
+# Data query example
+
+This example makes use of [langchain and chroma](https://blog.langchain.dev/langchain-chroma/) to enable question answering on a set of documents.
+
+## Setup
+
+Download the models and start the API:
+
+```bash
+# Clone LocalAI
+git clone https://github.com/go-skynet/LocalAI
+
+cd LocalAI/examples/query_data
+
+wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
+wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
+
+# start with docker-compose
+docker-compose up -d --build
+```
+
+### Python requirements
+
+```
+pip install -r requirements.txt
+```
+
+### Create a storage
+
+In this step we will create a local vector database from our document set, so later we can ask questions on it with the LLM.
+
+```bash
+export OPENAI_API_BASE=http://localhost:8080/v1
+export OPENAI_API_KEY=sk-
+
+wget https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt
+python store.py
+```
+
+After it finishes, a directory "storage" will be created with the vector index database.
+
+## Query
+
+We can now query the dataset. 
+
+```bash
+export OPENAI_API_BASE=http://localhost:8080/v1
+export OPENAI_API_KEY=sk-
+
+python query.py
+# President Trump recently stated during a press conference regarding tax reform legislation that "we're getting rid of all these loopholes." He also mentioned that he wants to simplify the system further through changes such as increasing the standard deduction amount and making other adjustments aimed at reducing taxpayers' overall burden.    
+```
+
+Keep in mind now things are hit or miss!
--- a/examples/langchain-chroma/models/completion.tmpl
+++ b/examples/langchain-chroma/models/completion.tmpl
@ -0,0 +1 @@
+{{.Input}}
--- a/examples/langchain-chroma/models/embeddings.yaml
+++ b/examples/langchain-chroma/models/embeddings.yaml
@ -0,0 +1,5 @@
+name: text-embedding-ada-002
+parameters:
+  model: bert
+backend: bert-embeddings
+embeddings: true
--- a/examples/langchain-chroma/models/gpt-3.5-turbo.yaml
+++ b/examples/langchain-chroma/models/gpt-3.5-turbo.yaml
@ -0,0 +1,16 @@
+name: gpt-3.5-turbo
+parameters:
+  model: ggml-gpt4all-j
+  top_k: 80
+  temperature: 0.2
+  top_p: 0.7
+context_size: 1024
+stopwords:
+- "HUMAN:"
+- "GPT:"
+roles:
+  user: " "
+  system: " "
+template:
+  completion: completion
+  chat: gpt4all
--- a/examples/langchain-chroma/models/gpt4all.tmpl
+++ b/examples/langchain-chroma/models/gpt4all.tmpl
@ -0,0 +1,4 @@
+The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
+### Prompt:
+{{.Input}}
+### Response:
--- a/examples/langchain-chroma/query.py
+++ b/examples/langchain-chroma/query.py
@ -0,0 +1,31 @@
+
+import os
+from langchain.vectorstores import Chroma
+from langchain.embeddings import OpenAIEmbeddings
+from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter
+from langchain.llms import OpenAI
+from langchain.chains import VectorDBQA
+from langchain.document_loaders import TextLoader
+
+base_path = os.environ.get('OPENAI_API_BASE', 'http://localhost:8080/v1')
+
+# Load and process the text
+loader = TextLoader('state_of_the_union.txt')
+documents = loader.load()
+
+text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=70)
+texts = text_splitter.split_documents(documents)
+
+# Embed and store the texts
+# Supplying a persist_directory will store the embeddings on disk
+persist_directory = 'db'
+
+embedding = OpenAIEmbeddings()
+
+# Now we can load the persisted database from disk, and use it as normal. 
+vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
+qa = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_base=base_path), chain_type="stuff", vectorstore=vectordb)
+
+query = "What the president said about taxes ?"
+print(qa.run(query))
+
--- a/examples/langchain-chroma/requirements.txt
+++ b/examples/langchain-chroma/requirements.txt
@ -0,0 +1,4 @@
+langchain==0.0.160
+openai==0.27.6
+chromadb==0.3.21
+llama-index==0.6.2
--- a/examples/langchain-chroma/store.py
+++ b/examples/langchain-chroma/store.py
@ -0,0 +1,28 @@
+
+import os
+from langchain.vectorstores import Chroma
+from langchain.embeddings import OpenAIEmbeddings
+from langchain.text_splitter import RecursiveCharacterTextSplitter,TokenTextSplitter,CharacterTextSplitter
+from langchain.llms import OpenAI
+from langchain.chains import VectorDBQA
+from langchain.document_loaders import TextLoader
+
+base_path = os.environ.get('OPENAI_API_BASE', 'http://localhost:8080/v1')
+
+# Load and process the text
+loader = TextLoader('state_of_the_union.txt')
+documents = loader.load()
+
+text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=70)
+#text_splitter = TokenTextSplitter()
+texts = text_splitter.split_documents(documents)
+
+# Embed and store the texts
+# Supplying a persist_directory will store the embeddings on disk
+persist_directory = 'db'
+
+embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
+vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)
+
+vectordb.persist()
+vectordb = None