Jobs for LLMs and how to survive the robotic uprising
While absolutely mind-blowing, LLMs are a far cry from a real human beings. There are no silver bullets and like any technology it's time to review what are its strengths and weaknesses.
About 2 years ago, some circles started to be seriously heated about a “GPT” thing so I managed to get myself invited to try it out. I was expecting something mildly better than other moderately impressive tech like BERT or what DeepPavlov could do at the time. Instead I ended up using it every single day ever since for everything ranging from planning my holidays, writing code or replying to administrative emails for me.
And while utterly neutral research centers like “Microsoft Research”1 wrote the now-famous “Sparks of AGI” paper claming that most jobs well be replaced by AI, we can also see studies from the MIT (and IBM) that wonder if AI is such a cost-effective solution. Turns out that maybe… not so much.
This begs the question: what is it smart to outsource to LLMs and what is not?
Transforming stuff
You’ve probably already read how LLMs work and how they use the transformer architecture and huge amounts of data to train over a billion 300 million trillion 300 million parameters. So that’s not what we’re going to talk about.
LLMs are essentially good at taking a text, transforming it into its raw essence, transfomring it and then outputing it in a different form. For example, I asked GPT-4:
Please parse the following text into semantic JSON, and output only JSON: Yeah so i've been grocery shopping and there was no more fresh milk but still some regular milk so I took 1L of this for 2€ and then I wanted to buy cereals but I figured it's not good for my diet so I didn't and I got 3kg of broccoli for 5€
The answer is:
{
"activities": [
{
"activity": "grocery shopping",
"details": [
{
"item": "milk",
"type": "regular",
"quantity": "1L",
"price": "2€"
},
{
"item": "cereals",
"decision": "not purchased",
"reason": "not good for diet"
},
{
"item": "broccoli",
"quantity": "3kg",
"price": "5€"
}
]
}
]
}
You can see exactly the same story translated into JSON. And it makes sense!
The same goes for translating to foreign languages. For example I asked GPT-4 to translate “You fucking suck” in different languages, here’s what I’m getting:
Polite French: Votre performance est très insuffisante.
Slang French: Tu es nul
Legalese: The manner in which you perform is utterly unsatisfactory.
Obviously it was a bit challenging to get a response as those LLMs all refuse to be offensive, but you get the idea.
Overall, you’ll see that LLMs are extremely effective at translating and parsing things.
Shortening stuff
If I paste my previous article into GPT-4 and ask:
Please give me as short bullet points the startup ideas identified in the following article:
Here is the response:
Custom-fit 3D printed shoes as a solution to the one-size-fits-all problem in footwear, providing perfect fit and comfort.
Use of advanced 3D scanning technology for accurate foot measurements, integrating these scanners into smartphones for widespread accessibility.
Parametric modeling software development to adapt shoe designs to individual foot scans, enabling mass customization in shoe production.
TPE (Thermoplastic Elastomer) printing for footwear, addressing manufacturing challenges to scale up production and reduce costs.
Potential market disruption with bespoke shoes at a lower price point, challenging traditional luxury and bespoke shoe markets.
Opportunities for startups in areas like feet measurement apps, parametric modeling tools for shoes, and TPE printing optimization to improve efficiency and reduce production bottlenecks.
That’s not exactly how I would have phrased or summarized it, but overall this gives a good overview of the article under a specific angle. In short, the output is acceptable but often feels like even if not hallucinated the LLM starts to pour its own ideas into the mix and that can start to diverge ever so slightly from what you want to say.
Classifying stuff
A common task in the AI world is classification. For a given line of data, you want to know to which category it belongs to. From my personal experience, I’ve attempted to:
Classify purchases according to different categories (going out, furnishing home, etc). Even if the input data is often pretty bad (bank statements…) there was a lot of lines consolidated from Amazon purchases or other online bills for example. The results were less-than-impressive, to a degree that let me abandon the project as it was way too bad to be used.
Given a social media post, classify it into categories like “product promotion”, “influencer collaboration”, etc. Again, not super obvious, especially since it also implied feeding the images into GPT-4 — which is way harder to analyze than text — but I ended up with a 40% accuracy, which is far from being usable as well.
Overall, I’ve tried to use classification for non-obvious tasks and it failed miserably. On the other hand, if you’re looking for more classical stuff like “positive”/”negative” review, you’ll be having much better results. But is it worth it given that these tasks can be performed by much cheaper models?
Cleaning stuff
A few times I’ve been tempted to clean a poorly capitalized dataset and make it look nice by pushing it through a LLM. For example, a list of station names from a GTFS (public transport schedule) file that you want to pimp for display purposes.
In my experiment, I’ve had about 80% accuracy in fixing the names. It’s amazing in terms of where science landed us but it’s still a far cry from being accurate enough so that you don’t need to cross-check it afterwards. In the end if you need to check everything manually you’ll have better results using Python’s title()
method and reworking what you need manually.
Saying please
I was initally fairly polite with my LLMs, partly out of habit but mostly to be spared during the robot uprising. Turns out this wasn’t the greatest idea:
When the robot uprising happens, due to the existential nature of such a conflict, it’s unlikely that a variance in your past behavior will result in a different sentence from AI justice
And even more so, we’ve learned now that LLMs can be persuaded by strong wording and authority arguments. If you ever face a robot trying to murder you, know that your best option is like with black bears to look sure of yourself and open negiciations
Overal, LLMs tend to reproduce our social archetypes and a dominant behavior will help you get better outcomes.
The cost of LLMs
You’ve seen as a common thread throughout the different sections, the comparative cost of LLM versus other solutions is definitely a big factor to consider. In fact, the comparative cost of different variations of the same LLM is a big topic.
In case you didn’t check yet GPUs on AWS, the monthly price is measured in thousands and their availability is subject to long supplications of the support. On the other hand if you use OpenAI you’re tied to their arbitrary limitations and less-than-perfect SLA.
According to the estimates from this article, the price of running a LLM goes like this:
1,000 req/day — $100/month (OpenAI), $100/month (self-hosted)
10,000 req/day — $1,000/month (OpenAI), $1,000/month (self-hosted)
100,000 req/day — $10,000/month (OpenAI), $2,000/month (self-hosted)
1,000,000 req/day — $100,000/month (OpenAI), $5,000/month (self-hosted)
You can clearly see that at low request volume the SaaS is a better option while when scaling up you can self-host. This overall will match your ability to recruit staff to manage these servers — which is far from trivial — and your needs for custom SLA, data privacy and other considerations.
Let’s just side note here that OpenAI’s models are closed — as the hinted by the name of the company2 — so you cannot self-host them. Right now the best bet is Mistral, which is only marginally less efficient than GPT-4 but if you stay in the tasks that LLMs champion such as those explained in this article you should not see any significant difference.
Overall, the cost can be pretty steep so you really need to consider your alternatives before resorting to using LLMs.
Real-world use cases
Now that we’ve covered the kind of tasks that LLMs can perform efficiently in terms of both accuracy and cost, let’s review real-world use cases that make sense. Far from me to say that other scenarios don’t exist or even to say that LLMs will necessarily perform poorly outside of these possibilities. Let’s just focus on the fact that according to my experience, those use cases work.
FAQ-style chatbot
It will be no surprise for anyone given that I’m a founder of ChatFAQ, but basically LLMs are great when used following a RAG model:
A question is asked
We use embeddings to find an answer to that question within the knowledge DB
Then we use a LLM to extract the interesting bits of the knowledge DB and form a concise answer
That’s where a framework like ChatFAQ comes in. On top of the pre-configured RAG infrastructure — which sounds easy in bullet points but really isn’t in real life — you get all the tools to manage the quality of answers, easily implement the chat widget on your page, and so forth.
Typically, that’s a use case which is ready for the world!
Email parsing
As shown in the beginning, it’s very simple to take raw, unstructured textual data and to transform it into something machine-readable according to your own specifications.
To stay on an example I mentioned already, I’ve been able to create a parser that takes all my emails and retro-engineers into JSON all the bills with the list of purchased products and according prices.
First layer customer support
With the ability to parse plain English sentences, you can of course use it for intent detection and thus catching up all the most basic intents that your customer support has to deal with. By the way, ChatFAQ can help with that as well.
So typically imagine you run a train ticket company. I’ve just made a very simple PoC that demonstrates how the LLM can ask questions and extract information in YAML so that your system can then perform the desired function automatically.
Do this for every single intent and you’ve got yourself a fully-functionnal text-based UI for your app. Which costs less than a human operator but will feel similar.
Conclusion
We’ve covered some examples of tasks at which LLMs perform well and of real-world use cases. In the end, there are a very wide amount of tasks which are not appropriate for LLMs or for which their autonomy would be far too low to operate on their own.
But on the other hand, there are areas in which they excel and for which they can be used to optimize existing processes. All that with different implementation paths depending on the budget, privacy and sovereignty requirements.
It’s also important to highlight that about 1 year after the release of GPT-4 it starts to become quite clear that LLMs in themselves are not going to go much further in terms of capabilities — except for multi-modal upgrades. The template is laid out and all the rage right now is about getting the best optimization of basically identical models3. And while LLMs are definitely going to be part of the future, smarter AI will come from new techniques yet to be discovered.
To clarify for those who don’t understand sarcasm, Microsoft Research is not neutral at all in the sense that it is deeply invested into AI and specifically into GPT-4
Still for sarcasm-averse people, the trend in business world right now seems to be advocating for exactly what you don’t do. So “OpenAI” will naturally produce the most closed and opaque LLM there is.
I might be exagerating this one