Goldman Sachs has been rapidly 
The bank’s Chief Information Officer Marco Argenti explained how he decides when and where to deploy advanced AI.
What went into the decision to roll this out to all employees now?
MARCO ARGENTI: There’s no quality AI without good quality data. You’re going to have AIs that give you the impression that you have good data, but then at the end of the day, it’s not going to sustain verification. So that’s point number one: really understanding when we’re feeling ready and confident with the data. The other one is really understanding what are the use cases that really add value, and also at which level and how they change. So over the past year, we’ve been piloting the GS AI assistant at all levels of Goldman Sachs – engineers, analysts, associates, vice presidents, marketing directors, partners – and they gave us a lot of feedback, and then also they give us a lot of questions and answers that we could benchmark to understand how to make the [assistant] better and better. And then obviously, there’s an element of safety. So when we were satisfied that this met all our very stringent requirements with regards to protecting all information, reducing hallucinations, and protecting from potential new vectors of attack – when all these things came together, we felt ready, we knew it added value.
As a result, the usage, not only the adoption, has gone up quite a bit, even from people that had it before. And so this month will be a record month for usage. We’re going to have over a million prompts written by our employees and we’re seeing the number of questions per employee going up constantly. That is really our guiding principle. We work backwards from our internal clients. We understand what they need, we understand what they want. When we feel ready to expand, we expand. And that’s exactly what we did.
So, a million prompts a month. Are you watching all those prompts and looking at who’s asking what, and is that going to become part of the performance review?
First of all, we have the obligation to be vigilant. And so there are controls that are looking at what type of information people put in the prompts, and then we apply filters, and we flag and so on and so forth. But that’s normal in a regulated industry. That’s what you do for every form of communication. Another thing that we look at is how models differ from one another. So, for example, right now, GS AI is not just based on a specific model. It actually gives you the choice of the latest and greatest model. So we have the latest GPTs. We have the latest Gemini. We have the latest Claude in various shapes and forms. And then the decision is, which model gives you the most accurate answer at the lowest possible cost and at the highest level of accuracy? By doing automated tests on that, we can verify, OK, we will privilege this model over this other for this kind of user or this kind of question. So there are a lot of dials to tune.
Can you share an example or two of where ChatGPT is better, or where Claude is better, or where Gemini is better?
One distinction is models that have reasoning capabilities. They create the plan, they go out and they do, like agentic AI. And those generally tend to be more expensive, GPT 3 or 4, versus GPT 1, which is more of a Q & A model. Agentic models tend to produce 10 times the tokens of a known reasoning model, so those are generally better suited for research type of questions: Here’s a list of companies, find all the comparables, in looking at their filings and rank them by sensitivity to a risk factor. A reasoning model can take the information, break it down, start looking out for data that represents potential comparables. And then they go and pick and fetch the filings. That is expensive, but then accuracy is a very important factor in this case, and the user doesn’t mind waiting 10 minutes for an answer.
But there are other types of questions, like ‘explain how to price a type of derivative,’ or ‘explain what this acronym means,’ or, ‘how does this product differ from this other product?’ In those cases, you can use an original model very efficiently.
There are other cases where you need to produce content, such as a presentation, that are more a realm of certain generative models that have multi-model capabilities, so they can generate images, they can generate graphs, they can generate a lot of supporting material. So it really depends on use cases.
There is a whole cohort of people which is our developer community, where there are similar considerations, but a very different type of work that is produced. All the developers have Github Copilot. They are getting the Devin technology. There’s a shift from the developer assistant that helps you auto complete your code or generate code for a specific task, so I now have a virtual colleague that I can delegate work to. And that’s really the game changer there. This is going from a suggestion assistant to someone that you can say, “go out and do this work for me: migrate this code base from Java 8 to Java 21.” Or, “there is a new framework, I want you to look at all the dependencies, and where do I need to update it to the latest version?” This has a different cost structure. It’s not so much a sidekick or companion. You need to think about, I’m staffing up a project, I’m going to need X amount of humans, X amount of virtuals.
Some people have said that work that would have gone to offshore outsourcers are now going to go to agents like this for low-cost work.
That may happen. Right now it’s too early to say exactly what’s going to happen. Many companies are constrained by how many engineers they have, and so there is always a lot of work. Our opinion is that this is going to help us become more efficient and at the very least, do much more with the same [workforce].
One thing I hear is that these tools get better all the time, and you’re all fine-tuning your own versions of them, but accuracy is not at 100% with a lot of these models. Accuracy is still in the high 90s. How do you look at that when you are using gen AI to prepare answers for clients or to make decisions?
It’s a great question. We are used to thinking about computers as 100% precise, because it’s an evolution of the calculator that is precise by definition. But you need to benchmark [generative AI models] with humans. How do they benchmark with the accuracy of a human that is doing the same job, and humans are by no means 100% precise. In every single job, there is a failure rate that we measure, in every single task done by you. So then the question is, can an AI do the job? Well, yes, if it is provable that it is at least as precise as a human. So the real learning here is that you need to put AI through the same level of scrutiny and controls that we put on humans.
We have standards and we have controls. We have code reviews, we have automatic controls through the pipeline before it goes to production. So it’s almost indifferent if it is a human or if it is an AI, because you have to pass through the same bar before anything goes into production. The same is true for translation. When you do summarization of documents, how many people make translation mistakes? But then there are reviewers, and it’s the same here.
When you start to get into that frame, the real thinking is you have a process, do you have a benchmark against which to compare the performance? Does the AI meet that benchmark? If not, then you work on data, you work on training, until it meets that benchmark, and then at that point, you can put it in a process for which you employ this AI the same way as you deploy a human.
You also see studies that show that people are getting dumber through the use of AI, they’re losing their ability to think critically by letting the AI do stuff. Creativity can also get lost.
I heard the same exact stories when the industry went from classic animation to computer generated animation. All of a sudden it was like, “oh my God, we’re losing all the nuances of people writing these beautiful drawings.” And then all of a sudden you find productions done in 3-D with animation that would be impossible with pen and paper and pencil. Or in music, with the advent of digital music, the advent of synthesizers, the advance of Pro Tools, it was, “oh my God, we’re going to be so lazy.” Now a computer can do all this for us. And guess what? I have big faith in our ability to be shocked at the beginning, but then adapt and use our own nature to become creative.
One of the things that we really want to be very aware of is what’s happening out there, more than ever before, because of the speed of change. So one of the things that we really like focusing on is how do we send the feelers out there and understand what is the innovation that could translate to something helpful for us. And so we created this GS Innovation Center in 2022 to drive the velocity of innovation. And that’s where the initial conversations with Cognition Labs happened and where we have a framework where we create a sandbox where these tools can be tested safely before they go into the machinery of the bank. And I think for us, one of the learnings is to really look outside and understand the state of the art and the state of the possible, and then create a safe environment for innovation that really stays very close and stays very engaged to those companies. It’s just unbelievable how AI has unleashed a new generation of companies and software providers that are doing incredibly creative and interesting things. AI is reducing the barrier of entry so much that startups are becoming very, very innovative and very quickly. We like to be in the center of that.
Are you seeing a return on any of this? In terms of hours saved or some other measure?
We measure, for example, development productivity very carefully. We definitely see that developers are, on average, 20% more productive, which translates into cost savings. We already are benefiting from that, and now we’re looking at, OK, agents are going to increase this number by how much. And then, when we started to put it in other operational roles, we are measuring how much that’s going to be. So we definitely are driving an ROI based approach to how we invest on AI, and we’re already starting to see the first results for sure.
