How Parloa Revolutionized Customer Support with Europe’s Largest GenAI Conversational FAQ
How can GenAI enable large-scale innovation in customer service? How do you evaluate the performance of a GenAI system used by thousands of customers a month, and what’s the most effective way to create safeguards that prevent hallucinations? What are the challenges in gaining customer acceptance, and winning their trust in AI conversations?
In 2023, Parloa successfully launched Europe’s largest GenAI-driven conversational FAQ service, for a multinational enterprise organization. This organization has more than 50 million customers in 30 countries, and they wanted a step change in their customer conversation service. Our innovative GenAI FAQ product has set a new standard in customer support, and we’d like to share our learnings with the ETLS community.
We’ll take a deep dive into:
- The major build hurdles we encountered, and how we addressed them
- How we measure the performance and effectiveness of the GenAI system
- The safeguards we’ve put in place to prevent hallucinations, and maintain the integrity of generated content
- How we’ve tackled significant challenges in gaining customer acceptance, focussing on concerns like data privacy and confidence in AI-generated responses
And we’ll explain how GenAI emerged as the indispensable solution for our customer’s problems, and why conversational AI is going to be so important in our industry.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
Gene Kim: Over the last 12 months, I have been so super excited at how many experience reports we have had from technology leaders as they experiment and try building new capabilities with all the things that AI is affording us these days.
Up next are Stefan Ostwald, CTO and co-founder at Parloa, and Peter Petrovics, strategic advisor at Equal Experts. They are going to present a fantastic experience of launching one of Europe's largest generative-AI-driven conversational FAQ services for a multinational enterprise with 50 million-plus customers in 30 countries.
They will talk about major hurdles they encountered, including managing and reducing hallucinations and privacy concerns, and how they overcame them.
I love this talk for many reasons, but I think it demonstrates how the job of technology leaders is not getting any easier, because we are having an ever-increasing number of functional specialties that we have to integrate to create our desired outcomes.
Stefan and Peter, I can see you. I am so delighted that you will be giving this talk.
Stefan Ostwald
Stefan Ostwald: As Gene already mentioned, we are looking today into how we can use GenAI to revolutionize customer support. We will take a look at one specific case we had with one large utility provider. It was the scenario of answering FAQ questions. We will dive into this briefly, explain what it is, and then share quite some challenges we had to overcome to really make this a success.
Before we get started, a brief introduction to what Parloa is. Parloa is an AI company to help automate contact centers. We provide a software-as-a-service solution to provide customer support on multiple channels. We are doing this for all kinds of customers: Decathlon, Barmenia, W&W, Deutsche Glasfaser, to name a couple, and doing this together with a great group of partners, among others KPMG and Microsoft.
But let us get started. Before that, a brief interruption for our presentation, because today is a great news day for us. We have a big announcement to do today. We announced our Series B.
So it is really a big day for us. We used generative AI successfully, and I think this is what was driving also that funding round. But without further ado, let us dive in.
What is the problem we really see in the world and which we are tackling? We are in the contact center. You all might know: you call the agent and then you actually have a terrible experience. You wait in line. You just want your problem solved, but you wait, and you want your topic solved. This is how we can help.
Several companies try to use scripted dialogue or scripted FAQ, so they list questions, and then you resolve them by explaining the answer, which is kind of in the area but not really to the point of what your problem is. Again, you are not really happy.
On the other side, you have companies who are exposed to high volumes of calls, and they really struggle to get the agents to deliver on this, leading to high waiting times.
For the agents, eventually it is not such an attractive job right now. They get busy with all the repetitive work, whereas they would really love to help the customers, because they are actually really educated.
What can you do? How can you get the easy stuff, the FAQ stuff, the simpler questions, out? This is what we will take a look at today, to make customer support a better experience.
This is the overall architecture we will look at. Afterwards, with Peter, we will drill into all the challenges we had.
What you see is in the beginning the customer comes and, in this case, asks, "Can I bring my dog to my flight to New York?" Let us take a look under the hood at what is happening.
First of all, I need some processing. We first play out, "Let me check your dog policy for you." Meanwhile, in the background, we create an embedding of this question: a vector representation which represents the meaning of this question.
With this embedding, we are able to look at the knowledge base of the customer to find relevant text parts which map to this question. We are looking at a vector database. We are not answering this based on the trained knowledge. This is a really important differentiation. We are using a retrieval-augmented generation approach.
We first look at the vector database: what is relevant content based on the semantic meaning of the question? Now we add this context, what is in the database, to the prompt of a large language model. The prompt is essentially the text which goes into the large language model, upon which the language model then can generate an output.
We add the context, the relevant knowledge, then we give it an instruction: what should it actually do? In this case, answer the question based on the given context. We share the question of the customer and the past conversation. Now, with this context, the large language model is able to contextually give the right answer based on the provided information of the customer.
In this case it says, "Yes, you can bring your dog on your flight, but you will be charged $75."
This is the high-level overview. What we did here was to give one example for a utility provider who needed to raise their fees because they had higher prices themselves. They sent out millions of letters, and you can imagine this is a situation where people call. They want to know: is this really needed? What is the background of this? Can we get around this?
We had 20 days to implement such an AI agent to really deliver on these questions, and it was really successful. Twenty-five percent of the people who called during that time were calling because of this letter. Of all these questions which came up, 86% we were able to resolve directly. The agents were really able to focus on high-value topics, whereas these simple questions were able to be automated by our AI.
From the outside, this of course looks easy, but the reality is there were quite some challenges. This is where we will go into Peter. Share with us a little bit more: what did you have to overcome to achieve this?
Peter Petrovics
Peter Petrovics: Thank you, Stefan. It was great. The journey was not, right? There are plenty of challenges to share. Actually, if we are looking at the time, I probably have to rush through a few things. Anyway, we picked only a few examples. I think most of the topics we will discuss here are actually each of them itself worth a presentation.
Let us look first at hallucination and confabulation. We hear a lot about LLMs doing this. I brought up a few actual real-world examples. By the way, all of these are sorted now, but I went back to our early days and put up a few actual examples.
For instance, pure nonsense. If you look at this, I did not even have an explanation. "Can I insure my dog with you?" Unfortunately, the LLM actually answered this: "Unfortunately, we only offer insurance for cats and dogs." I cannot go into details of what happened here. It was a combination of knowledge base, weird wording, and some language barriers.
Also, if you look at the two middle ones, they are making wrong assumptions based on the context, whether it is incorrect or whether it is just too creative. For instance, nowhere in the knowledge base was mentioned that turtles can be insured, but it assumed that they can be insured.
These examples are why I really like to discuss what we mean by hallucination for LLMs. A lot of the time, when the LLM is creative, we are happy. If I ask, "Is there a window seat on a flight?" we are happy that the LLM knows by base knowledge that there is a window seat. But sometimes, for instance in this turtle case, we do not want it. So it is not black and white what hallucination is. I much prefer the confabulation terminology for this.
There are techniques for how we can make it better and make sure it is within the guidelines, but this is not easy. As you can see, it is not a black-and-white question.
The other big challenge in the last example on this slide is following instructions correctly. We actually said in the instructions: never refer to the context, because it just sounds weird. But it still sometimes did. We have multiple examples of instruction following, especially as the context grows and so on. Again, these are some examples of the challenges we faced.
There is no single solution, so I just picked a few things. First of all, everyone has heard about prompt engineering. I am not sure it is engineering. I call it prompt tuning. It is a little bit more trial and error, mainly because LLMs are really flaky. Even a minor paraphrase in the prompt can make a really big difference in how the LLM reacts, even with zero temperature. They are basically non-deterministic.
That leads to one kind of solution: you obviously really need to track the versions of your prompt to see what broke this thing, so you can always draw back and see what has happened.
Also, limiting the configurability of the prompt. If anyone can change anything in the prompt, it is very easy to break the quality and performance of the solution in a big way.
Most importantly, I cannot emphasize enough, different types of automated and human regression tests are required to follow this and spot this, because, as you see, challenges even in other places in the prompt can break by minor change.
One way to address this: we created a system where human evaluators can actually label the conversations. This is during when we are building and setting up these bots. Obviously, that is the key time and where most of the effort is. But we think it is really important to continuously monitor a sample of these during production time, because things can drift. The knowledge base might have changed. Every change requires more effort to look and monitor from real-world questions and real data, and then you can act and react to these.
The other big reason human labeling is needed is because automated evaluations are really challenging. That is probably a two-hour talk or a four-hour talk. Human labeling is important so we can check the correlation between the human labeling and our automated labeling: how much we can rely on the automated labeling. This gives us confidence in them. By the way, small print: you cannot 100% rely on automated labeling. But that is again a big topic.
The idea that prompt tuning or prompt engineering is a science: actually, there are multiple papers about this. This is not science, unfortunately.
The best approach for this, and this is a project we carried out in collaboration with the Technical University of Munich, is that we worked on an automated prompt tuner. Basically, what it does is take a prompt as a starting point. It mutates and runs it and checks the scores with our automated evaluations, picks the best one, mutates again, and does the test again. This is an approach which is a little bit more systematic or methodical way to find the best prompt. It obviously gives a score for which prompt was the best and helps you find the best, most optimal prompt.
Why is it important? I think most of you heard about papers with very surprising results running similar approaches: "take a deep breath" was, for instance, the instruction which helped the most in one case. How on earth could we think about just playing around with prompt engineering to find "take a deep breath"? There are even wilder examples than that. These kinds of instructions need a little bit more methodical approach to find them.
The other big question here is: obviously we want to limit the exposure in the system, so how much people can jailbreak. I think all of us heard examples that made the news of different companies being hacked or jailbroken with really basic questions.
Those examples, to be honest, felt like they did not do even the basic prompt-instruction gating of the bots, agents, or FAQs. What we are talking here about is the additional layer of protection. We are using multiple layers. One is, for example, Azure provides content filters both ways: what comes in and what comes out, we can filter that. The additional layers are third-party libraries we are utilizing on top of the basic techniques of restricting by prompt engineering or prompt tuning what comes in and out.
This slide talks about a lot of topics which are important for enterprise readiness. There is one thing that is good for us as general usage, asking GPT about things, but for enterprise, obviously, there are much more risks. I think all of you heard about examples when a Chevrolet was sold for $1. They asked the LLM to make a legally committing offer for $1 to buy a Chevrolet. Also, privacy in Europe, GDPR, and so on require additional focus on enterprise readiness.
As I mentioned, we are using Azure security capabilities. What is really important, and that is constant, is that we do not want user data or call data to appear in LLM training data, because it could actually surface in other uses. This is isolated. Whatever we use, whatever goes into the LLM, is never used to train a language model.
I am jumping topics a little bit. Another big topic is latency. The more advanced model we are using, the slower it is. There is a big latency question. How we resolve latency has a lot of layers. What I am focusing here on is one layer, perceived latency.
It is one thing that the LLM is slow, but it is really bad if you are asking something and you have just silence. One way to improve perceived latency that we introduced is intermediate responses. These are fast and contextual intermediate responses. While the LLM is actually working out the proper answer, there is a quick contextual answer to acknowledge the question.
Also, we have some typing sounds and different sounds which are played while it is thinking. It is not just dumb silence. You know something is happening. This is the feedback: the system is working, it is not broken, the line is still there.
Surprisingly, ambient background sounds help a lot. Total silence makes people think that something is broken. If you hear some ambient background sounds, very, very subtle, that helps people understand what is happening.
If you want to summarize what lessons we learned, there is a lot. There is no single silver bullet: you have to do this or that. Usually, a very good solution, if you ask what made it a success, is a lot of small things. The whole team worked together a lot and tried to find the best solutions.
First of all, I think it is proven that using LLMs with retrieval-augmented generation provides huge value for customers. It has been proven.
Also, it is really important to continuously monitor, evaluate, and improve the system, because the questions are changing, and you learn from production questions and so on.
I mentioned already the hybrid approach of evaluations and how difficult they are. You need a combination of human and automated evaluations. Latency: I could talk one more hour, and I can see Gene is probably going to interrupt, but I have time, right?
Gene Kim: You have two minutes. Take it all the way.
Peter Petrovics: Fantastic. So then I do not have to rush so much. Basically, hybrid evaluation is super important. Maybe if you ask, one of the most challenging things is to create really good automated evaluations for this. If you think about how different the questions can be and how the LLM can actually give different answers, even with a very subtle difference in the question, a good automated evaluation is probably the most challenging thing. We are still working to improve those automated evaluations. We still need human labeling and, I am pretty sure, will for quite a long time need human labeling and monitoring there.
Latency, as I mentioned, perceived latency is a factor. The whole audio engineering is a big topic: how can I reduce latency? There are a lot of promising things happening in the LLM ecosystem, like Groq, for instance, hardware support for faster invocations, and so on. There is hope that it will be less and less a problem, but currently it is still a problem.
Stefan Ostwald
Stefan Ostwald: I guess first of all, it was a massive success, right? Peter, thank you so much to you and the team. It was really impressive what you all built to make this success possible. Really amazing results.
If you are interested to really augment your contact center, of course, reach out. Also, if you think, "What an exciting topic, I want to dive into this," let us do this. Reach out. We have many great opportunities, and we are looking for great talent who is passionate about that topic and really changing the world of conversational AI.
Host Outro (Gene Kim)
Gene Kim: Fantastic. Thank you so much, Stefan and Peter. By the way, I put some questions in the Slack channel about how much you had to experiment with organizing DevOps, MLOps, data scientists, etc. Congratulations on all your achievements. Thank you so much, and looking forward to the continuation of your journey.
Stefan Ostwald: Thank you. Looking forward to the chat later. See you.
Gene Kim: Fantastic. Thank you.