The Data Playbook: Making Informed Choices around LLMs


Has your boss shared a tweet with you saying “Hey, look! The problem we’ve been trying to ‘manually’ solve can be easily tackled with this prompt from ChatGPT!, stop wasting your time, use ChatGPT!”?

Well, this is a guide for when it’s time to have that conversation. Considering you’re a data person in 2023, it won’t be long till that happens.

LLMs are wonderful creations, capable of so much more that we currently understand, especially in the conversational AI space. They perform well on common NLP problems such as information extraction, classification, question answering, and more. However, their best NLP use is possibly additive to the current state of data science & engineering, rather than completely replacing it.

Why might in-house solutions of data transformation, cleansing, multiple models, be better than to just use LLMs?

That is because when building a solution, we consider multiple factors, notably: accuracy, explainability, latency, and cost. LLMs might be accurate, but not so great at the rest. Let’s explore how LLMs evaluate against those metrics.

In my time at Drahim (YC22), I and my team built an Enrichment engine that classifies transactions with %97 accuracy (measured against user class update for active users).

We’ve tackled this problem with 4 sequential machine learning pipelines, pre & post processing, and efficient use & structure for database tables. Let’s see how well LLMs compare in this task.

Accuracy & Fit For Use 🎯

First, can LLMs solve the NLP problem we are facing?

Let’s validate that LLMs can solve our NLP problem. I asked ChatGPT to classify a transaction into one of travel, health, or food. Check this example:

Chat GPT Example of correctly classifying a transaction

ChatGPT does classify it correctly. However, LLMs may look like they can solve a problem using an example, but that example might be cherry-picked 🍒. Before trying a test set of transactions, ask your self is this example representative of my set? Is it special in some way that allowed the model to predict an answer? and does this example represent the data that I am expecting?

Let’s try another example, missing the ‘hospital’ keyword, another example of real life data the model is expected to classify in production (Alnahdi is a popular pharmacy in Saudi):

Chat GPT Example of incorrectly classifying a transaction

The LLM failed for the second example, understandably (the correct class is health). As I observed working with merchant names in the real world, half of the time the merchant name alone is not indicative (to humans!) of the class of the transaction. That’s a reason transactions classification is a tough problem, tackled by companies that specialize in solving it (e.g. Ntropy).

However, if you try a couple of examples from your data set and the LLM is able to classify them correctly, go forward and ask: Is it able to classify into dynamic classes? Is it able to classify to your business-specific classes, with their own rules and nuances? So far, you have validated the business-fit of the LLM, or the fit-for-use. If the LLM does pass these checks, Next is of course validating the LLM with model evaluation metrics on your data set, measuring how many examples it classifies correctly, F1 score, and so on, in other words its accuracy.

Second, Are LLMs better than our efforts so far?

After validating the LLM’s performance, you benchmark the results against your current pipeline. If the current offering outperforms the LLM, the answer to replacing the current pipeline with an LLM is an easy no, because it simply does not solve our problem. However, if it does solve the problem, and matches or slightly exceeds the results of your in-house work, that’s when the conversation is a bit harder to have.

Say you’ve found that LLMs are a really great fit for your predictive task. Now what?

Explainability 🔍

Explainability is important for many reasons, especially in high-stake systems; it ensures fair unbiased predictions, as well as allows the ability to debug a failed prediction and tune the model accordingly. Hosted models are not explainable, there is no way to reason why a model predicted the answer it gave. This is not possible in case of hosted LLMs, as opposed to in-house models or statistical methods.

Let’s say you’ve found an ML technique to provide some interpretability to your LLM’s predictive task. Then what?

Latency ⏳

A system’s latency is the time it takes to output a response. In this case, the time it takes to classify a transaction. Latency is an issue whenever you rely on a provider to request & respond over http. Your system’s latency is now dependent on their loads, network, availability, geographical location, etc. You might have outsourced the complexity of the pipeline, but you also introduced an external dependency to your system. This is a key consideration in any API service, not only hosted LLMs. An advantage of hosting your own models or owning the computation is to own the network. When you’re able to deploy the model and its user (service) on the same premise or cloud, you are able to optimize for latency when needed.

Now say that OpenAI’s API has been hauled and the extra latency is insignificant, then what?

Cost 💸

Since you’re using a hosted service, that comes at a cost, which is a major deterrent from using LLMs. When using LLMs, you are charged per token, which can accumulate quickly, and you might have to restrict number of tokens.

Another factor into how costly they can get is mis-use, especially by beginners. In the example I showed above of classifying a transaction, I did not need to include the entire transaction in the prompt. In a regular pipeline, cleansing is normally a step, only the key information relating to the class would be extracted and used for prediction. In this example it’s the merchant’s name.

LLMs cost could sometimes be entirely avoidable when they are an overkill, such as in some information extraction problems. In those problems, usually cleansing scripts and information templates (sometimes as unintelligent as regexes) are a key step in extracting information. When those pipelines fail, processed data are passed onto more task-specific models.

This mis-use & over-use accumulates cost quickly, also jeopardizes possibly sensitive data.

Privacy 🔒

Finally: you might need to come up with clever ways to hide sensitive data from your LLM providers, factor that into your development time (keep an eye out for my next post covering how to tackle this issue).

Takeaway 🥡

LLMs may be able to solve the problem just as good as your pipeline, and using them is a simple and straight forward implementation. All the more enticing to just replace all scaffolding with a request to ChatGPT. However, they might not stand as well to other system expectations.

LLMs are powerful, they do make sense in NLP-intense features such as conversations and language generation, but for the rest of NLP problems, it might make sense to invest the time and effort in building your own use-case specific solutions. My team at Drahim did end up using LLMs in production, but to tackle a completely different problem: conversational features 💬.

In Summary, when deciding to use an LLM to solve a problem, consider:

  • Accuracy (fit for use)
  • Explainability
  • Latency
  • Cost


These are my 2 cents, from someone that built in-house models, as well as used hosted models, pertained models, and LLMs in production.

Tweet at me for any questions, corrections, or comments: @Anfal_Alatawi


Photo by Alina Grubnyak on Unsplash

Written on May 21, 2023