llama_index: [Bug]: LLMPredictor Cache arguement

Bug Description

I want to add gptcache object in cache to have that in llama index, but now LLM Predictor is not taking up cache arguments, how can we do caching now.

index = GPTVectorStoreIndex.from_documents(
    documents,
    service_context=ServiceContext.from_defaults(
        llm_predictor=LLMPredictor(cache=gptcache_obj)
    ),

cache=gptcache_obj throws error as cache is not a valid argument.

Version

latest

Steps to Reproduce

try to integrate llama index with gpt cache

import hashlib
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")

gptcache_obj = GPTCache(init_gptcache)

from llama_index import (
    GPTVectorStoreIndex,
    ServiceContext,
    LLMPredictor,
    SimpleWebPageReader,
)

loader = SimpleWebPageReader(html_to_text=True)
documents = loader.load_data(urls=["https://milvus.io/docs/overview.md"])

index = GPTVectorStoreIndex.from_documents(
    documents,
    service_context=ServiceContext.from_defaults(
        llm_predictor=LLMPredictor(cache=gptcache_obj)
    ),
)
query_engine = index.as_query_engine()

Relevant Logs/Tracbacks

No response

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 15

Most upvoted comments

🤖

Hello,

Thank you for reaching out. It seems like you’re trying to integrate GPTCache with LlamaIndex. However, as you’ve noticed, the LLMPredictor class does not currently accept a cache argument.

Based on similar issues that have been resolved in the past, it appears that caching can be achieved by setting the cache on the langchain module if you are using langchain LLM classes. Here is a sample code snippet from a previous issue:

import langchain
from langchain.cache import InMemoryCache

langchain.llm_cache = InMemoryCache()

from langchain.chat_models import ChatOpenAI
from llama_index import ListIndex, Document, ServiceContext

llm = ChatOpenAI(model_name="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)
index = ListIndex.from_documents([Document("Hello World")], service_context=service_context)
index.as_query_engine().query("Hello?")
index.as_query_engine().query("Hello?")

In this example, the second query will return instantly due to the caching mechanism. This feature should be added to the native LLMs at some point too.

Please note that this solution is based on the langchain module, and it might not be directly applicable if you’re not using langchain LLM classes. If you’re not using langchain, could you please provide more information about your setup so I can assist you better?

I hope this helps. If you have any other questions or need further clarification, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Based on the current architecture of the LlamaIndex data framework, caching is handled by storing parent nodes in a dictionary called parent_nodes. This dictionary acts as a cache, storing parent nodes as they are retrieved from the document store. This approach helps to avoid unnecessary database calls by reusing previously fetched parent nodes, thus improving the performance of the data framework.

If you want to implement caching in your context, you can follow a similar approach. Here’s an example of how you can modify your get_query_engine method to include caching:

class QueryEngineCache:
    def __init__(self):
        self.cache = {}

    def get_query_engine(self, tenant_id, tenant_index, tenant_config, model_name, node_postprocessors, text_qa_template, synthesizer_mode):
        cache_key = (tenant_id, tenant_index, model_name)
        if cache_key in self.cache:
            return self.cache[cache_key]

        query_engine = self.__llama_idx_svc.get_query_engine(tenant_id,
                                                             tenant_index,
                                                             tenant_config,
                                                             model_name,
                                                             node_postprocessors=node_postprocessors,
                                                             text_qa_template=text_qa_template,
                                                             synthesizer_mode=synthesizer_mode)

        if query_engine is None:
            raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Unable to construct Query Engine")

        self.cache[cache_key] = query_engine
        return query_engine

query_engine_cache = QueryEngineCache()

And then you can use the query_engine_cache object to get the query engine:

query_engine = query_engine_cache.get_query_engine(tenant_id,
                                                   tenant_index,
                                                   tenant_config,
                                                   model_name,
                                                   node_postprocessors=node_postprocessors,
                                                   text_qa_template=text_qa_template,
                                                   synthesizer_mode=synthesizer_mode)

response = query_engine.query(query)

This way, if a query engine with the same parameters has been created before, it will be retrieved from the cache instead of being created anew. This can significantly improve the performance of your application, especially when dealing with a large number of queries.

Please note that this is a basic example and might need to be adjusted based on your specific requirements and the overall architecture of your application. For instance, you might need to implement cache eviction strategies if memory usage becomes a concern.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

From the error messages you’re getting, it seems like the GPTCache object you’re using doesn’t support the in operator or item assignment, which are both required for your CachedLLMPredictor to work.

Based on the information provided, it’s not clear what the structure and functionality of the GPTCache object are. However, I can provide some general advice on how to proceed.

Firstly, you need to ensure that the GPTCache object supports the necessary operations. Specifically, it needs to implement the __contains__ method to support the in operator and the __setitem__ method to support item assignment. Here’s an example of what these methods might look like:

class GPTCache:
    def __contains__(self, key):
        # Return True if the cache contains the key, False otherwise
        pass

    def __setitem__(self, key, value):
        # Set the value for the key in the cache
        pass

If the GPTCache object can’t be modified to support these operations, you might need to create a wrapper class that provides the necessary interface. Here’s an example:

class CacheWrapper:
    def __init__(self, cache):
        self.cache = cache

    def __contains__(self, key):
        # Implement this method based on the interface of the cache object
        pass

    def __setitem__(self, key, value):
        # Implement this method based on the interface of the cache object
        pass

Then, you can use the CacheWrapper in your CachedLLMPredictor:

gptcache_obj = GPTCache(self.init_gptcache)
cache_wrapper = CacheWrapper(gptcache_obj)

service_ctx = ServiceContext.from_defaults(llm_predictor=CachedLLMPredictor(llm=llm, cache=cache_wrapper),
                                           embed_model=embed_model,
                                           node_parser=node_parser,
                                           callback_manager=self.__get_llm_callback_manager())

I hope this helps! If you can provide more information about the GPTCache object, I might be able to give more specific advice.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

From the error message you provided, it seems like the CachedLLMPredictor class is being treated as a Pydantic model. Pydantic is a data validation library that uses Python type annotations to validate the data. It raises a ValueError when you try to assign a value to an attribute that is not defined in the model’s __init__ method.

The LLMPredictor base class you’re inheriting from is a Pydantic model, and it seems like it’s not allowing additional attributes to be added in subclasses. This is likely because the BaseComponent class that LLMPredictor inherits from is a Pydantic model with the Config class set to extra = 'forbid', which forbids extra attributes.

To resolve this issue, you can override the Config class in your CachedLLMPredictor subclass and set extra = 'allow'. This will allow extra attributes to be added to the model. Here’s how you can do it:

from llama_index.llm_predictor.base import LLMPredictor
from pydantic import BaseModel

class CachedLLMPredictor(LLMPredictor):
    cache: Any  # Define the cache attribute

    class Config(BaseModel.Config):
        extra = 'allow'  # Allow extra attributes

    def __init__(self, cache, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.cache = cache

    def predict(self, prompt: BasePromptTemplate, **prompt_args: Any) -> str:
        cache_key = (prompt, tuple(sorted(prompt_args.items())))
        if cache_key in self.cache:
            return self.cache[cache_key]
        else:
            result = super().predict(prompt, **prompt_args)
            self.cache[cache_key] = result
            return result

This should allow you to add the cache attribute to the CachedLLMPredictor class without getting a ValueError. However, please note that this is a workaround and might not be the best solution if the LLMPredictor class or the BaseComponent class it inherits from is updated in the future to forbid extra attributes.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Based on the issues you’ve mentioned and the code you’ve provided, it seems you’re trying to integrate GPTCache with LlamaIndex using OpenAI classes. As you’ve noted, the LLMPredictor class does not currently accept a cache argument. However, you can create a subclass of LLMPredictor and override the methods where you want to use the cache.

Here’s an example of how you can do this:

from llama_index.llm_predictor.base import LLMPredictor

class CachedLLMPredictor(LLMPredictor):
    def __init__(self, cache, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.cache = cache

    def predict(self, prompt: BasePromptTemplate, **prompt_args: Any) -> str:
        cache_key = (prompt, tuple(sorted(prompt_args.items())))
        if cache_key in self.cache:
            return self.cache[cache_key]
        else:
            result = super().predict(prompt, **prompt_args)
            self.cache[cache_key] = result
            return result

In this example, CachedLLMPredictor is a subclass of LLMPredictor that accepts an additional cache argument in its constructor. The predict method is overridden to first check if the result is in the cache before calling the superclass’s predict method. If the result is not in the cache, it is computed, stored in the cache, and then returned.

You can then use this CachedLLMPredictor in your ServiceContext:

cache = {}  # or your preferred caching mechanism
llm_predictor = CachedLLMPredictor(llm=llm, cache=cache)

service_ctx = ServiceContext.from_defaults(llm_predictor=llm_predictor,
                                           embed_model=embed_model,
                                           node_parser=node_parser,
                                           callback_manager=self.__get_llm_callback_manager())

Please note that this is a simple example and may not cover all your needs. For example, it assumes that the cache is a dictionary-like object and that the prompt and prompt_args can be used as dictionary keys. You may need to adjust this code to fit your specific requirements.

I hope this helps! If you have any further questions, please don’t hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

To implement semantic caching using GPT cache, you would need to modify the QueryEngine class and the QueryEngineCache class.

Firstly, you would need to add a cache to the QueryEngine class. This cache would store the results of previous queries. When a new query is made, the QueryEngine would first check if the query or a semantically similar query is in the cache. If it is, the QueryEngine would return the cached result instead of performing a new query.

Here is a basic implementation of this idea:

class QueryEngine:
    def __init__(self, ...):
        ...
        self.semantic_cache = {}

    def query(self, query):
        # Check if the query or a semantically similar query is in the cache
        for cached_query, cached_result in self.semantic_cache.items():
            if self.is_semantically_similar(query, cached_query):
                return cached_result

        # If the query is not in the cache, perform the query and store the result in the cache
        result = self.perform_query(query)
        self.semantic_cache[query] = result
        return result

    def is_semantically_similar(self, query1, query2):
        # Implement this method to check if two queries are semantically similar
        ...

In the QueryEngineCache class, you would need to modify the get_query_engine method to pass the semantic_cache to the QueryEngine when it is created:

class QueryEngineCache:
    def get_query_engine(self, tenant_id, tenant_index, tenant_config, model_name, node_postprocessors, text_qa_template, synthesizer_mode):
        cache_key = (tenant_id, tenant_index, model_name)
        if cache_key in self.cache:
            return self.cache[cache_key]

        semantic_cache = {}  # Initialize the semantic cache
        query_engine = self.__llama_idx_svc.get_query_engine(tenant_id,
                                                             tenant_index,
                                                             tenant_config,
                                                             model_name,
                                                             node_postprocessors=node_postprocessors,
                                                             text_qa_template=text_qa_template,
                                                             synthesizer_mode=synthesizer_mode,
                                                             semantic_cache=semantic_cache)

        ...

Please note that this is a basic implementation and may not be optimal for your use case. The is_semantically_similar method, in particular, could be improved by using a more sophisticated method for comparing the semantic similarity of two queries, such as using a language model like GPT-3.

Also, please note that this implementation assumes that the QueryEngine class and the __llama_idx_svc.get_query_engine method can be modified to accept a semantic_cache parameter. If this is not the case, you would need to find another way to pass the semantic_cache to the QueryEngine.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Well i think best would be if we can have cache arguments, the way it is mentioned in GPT cache documentation. is there any plan to have this in future releases?