How to go 100% cloud, fast: Inside S&P Global Ratings’ aggressive computing overhaul
Mark Wang, the company's head of cloud engineering, had a three-year plan to reboot its approach to computing. It was ambitious to say the least.
"We're a 160-year-old institution," says Mark Wang, head of cloud engineering at S&P Global Ratings. "Now, we're moving at the pace of a fintech."
Unlike many fellow financial services companies, which have been slow to adopt cloud computing, Wang's team at the global ratings agency has been nothing less than extremely aggressive in its rollout of new technology. Last year, it moved more than 160 of its internal applications to the cloud, bucking the trend of compromising on the hybrid cloud. This year, it's embarking on an ambitious plan to re-architect those applications around serverless computing principles and Kubernetes, using the Knative open-source project.
"Our cloud journey started in 2018," said Wang, underscoring the pace of his team's works. "We've built out all the cloud expertise in-house, that's one thing I'm very proud of."
Unlike fellow financial services stalwart Bloomberg, which has chosen to implement modern cloud technologies alongside older but stable internal infrastructure, S&P is in the middle of a wholesale renovation of its internal workings. The idea is to improve the speed at which S&P ships software, and to give developers the freedom to write code without having to think about where that code will wind up running.
In a recent interview with Protocol, Wang outlined S&P's progress to date, why it's thinking about hedging its bets with multiple cloud providers and how it has incorporated artificial intelligence and machine learning into its IT strategy.
This interview has been edited for clarity and brevity.
What is S&P's current cloud and infrastructure strategy?
From a strategy perspective, this year we have outlined our three-year vision on the technology side. And the vision is that year one is to remove the friction, is to become mature. And year two is really to start focusing on commercializing, which is to enable that speed to market.
Then the following year is to start to innovate. Once we have the speed to market, then we have that vehicle to ... generate new ideas and start to ship faster.
What does that mean? To remove the friction and to become more mature?
One is stability, that's really to fix our foundation. I'll give you an example: We moved to the cloud aggressively in 2019, we moved 160-plus apps to the cloud in nine months. That was more of a replatforming and rehosting, so that doesn't mean we've eliminated technology. So stability is mostly around that: How do we automate patching? How do we eliminate the toil? How do we make sure that the environment is stable? How do we make sure we eliminate a lot of the manual startup shutdowns, the vulnerabilities that we have, and so on?
Then there's [continuous integration] and [continuous delivery], so we added a [continuous testing] as well. You probably know all the studies about how if you eliminate bugs in the development lifecycle, then you don't have to worry about it after release. Part of the stability initiative is also to reduce all the incidents overall by half.
And the other initiative, which is dear and near to my heart, is going serverless, [using] functions-as-a-service. So we took a giant leap from moving 100% to the cloud: We jumped directly to serverless. So serverless is going to become how we develop and we've partnered with Knative.
Can we back up just a little bit: You're 100% on the cloud? You've moved everything?
That's right. Our cloud journey started in 2018. We've built out all the cloud expertise in-house, that's one thing I'm very proud of. We didn't go out and hire a bunch of people or hire consultants. One of the learning objectives for the organization this year is we want teams, we want engineers, developers, QA, all [our] folks across the board to learn cloud and DevOps, to be certified and actually invest in cloud and DevOps.
In 2018 when we started our cloud journey, we moved our DMZ applications to the cloud, but given that we're a ratings agency and the majority of our applications are internal-facing, the DMZ was a small subset. So the majority of the 160 [apps], and all of the database, we moved all of that stuff to the cloud in a matter of about eight to nine months.
I know you're using AWS for some parts of your operation. Are you using it for everything?
Well, not technically everything. We also have a presence in Alibaba, and we're starting to use Azure as well. Amazon was our first target move. But our cloud strategy is multicloud, cloud agnostic, because of things like what happened to Azure [in September].
We always see that as a risk, if we've put all our eggs in one basket, given the criticality of a ratings agency. We want business continuity.
How do you balance investing in functions-as-a-service with multicloud? That seems difficult.
After we migrated to the cloud in 2019, we started thinking about our cloud strategy. That's when we decided to go FaaS. I researched the market along with my team, and I evaluated all the open-source FaaS technologies, like OpenWhisk, OpenFaaS. There are a lot of packages out there, but Knative was something that we decided on, and so far it has been very good to us.
So are you using Knative in production, then?
Yes, we are using Knative in production. By the end of Q1 , we had a handful of applications going live. And the target is by the end of the year, we'll have all of our applications on Knative.
I've been hearing a lot of talk about people kind of backing away from functions-as-a-service, and looking at serverless as more of a management layer that abstracts away infrastructure, not necessarily betting as much on functions themselves. It's a little hard to know what people are really doing and what people are talking about doing. But it's interesting to hear that you're using this kind of event-driven architecture, but then you have this Kubernetes layer; for a lot of the last couple of years, people have talked about those as almost in competition. And you're finding a way to do both.
I think the benefit of something like Knative over [AWS's] Lambda, for example, is that we have a lot more control over what goes into Kubernetes. If we were just to use that base functionality of FaaS, they were very limiting.
We have a diverse application portfolio, and we have legacy application support. So we have now Python, Java, we even deploy our own frontend into Knative. Our Angular applications go into Knative as well.
So as part of the Strangler pattern [for] breaking down the monolith, we have app rationalization in this demolition as well. What we're doing there is we're looking at, what are the capabilities and what are the common capabilities? We want to make these common capabilities available as functions.
You mentioned databases. What did you decide to do there?
We moved 100% to the cloud, [but] we were actually on Oracle Exadata. And we use a technology called FlashGrid to replicate the performance in the cloud. So now our Oracle is running on [AWS's] EC2 FlashGrid. So we moved our data into the cloud.
We also have a data strategy to further break down our data to purpose-built data stores. This way, we have the full CI/CD experience before microservices.
Why are we doing all this? It's really for the speed to market. Our development teams have their independent ownership of the stack. This way they can ship their product, and test. They don't have to depend on anyone. They have full control of application data, the whole nine yards.
OK, that's what's happening now. Where are you headed?
Year two is really to enable the data, to have that agility with the data front [of what we do] as well, and then slowly introducing AI and machine learning.
What do you want to get out of AI and machine learning?
To be honest, it really is a cultural shift. We want our teams to know AI and ML just like the way they learn functions, servers, Kubernetes, Knative, etc.
The initial step really is around efficiency. How does AI or ML solve your day-to-day problems? How does it remove friction? How does it make, for example, you think about monitoring? We get bombarded sometimes with monitoring [alerts], how do you find the insight in that? And then how can we use AI and ML to help us with test automation to improve quality?
It's all outcomes, but the initial step is to learn and to start introducing those capabilities into our day-to-day work.
Over the last couple of years, it seems that financial services companies have been slower to embrace some of these technologies, but it feels like that has changed a lot recently. I was wondering if you could point me towards something that changed; was there a technology where financial companies felt a little bit safer moving to the cloud? Was it a cultural thing?
Six, seven years ago, I was talking about cloud, and at that time we had a CTO that was retiring. He didn't want anything to do with it. And then ever since that time, it's always been cloud.
I think that the breaking point is probably when you see other companies doing it, when you see people like Capital One talking about it, open sourcing their code. It's one of those things ... it's a competitive thing; once you see others doing it, you want to do it too, and you want to do the best you can.
From our perspective, it has really benefited us, especially around COVID, because we are now [using] AWS WorkSpaces. So your laptop is broken: In a day, you have a workspace spun up with all the software that you need, and you're able to get your work done. It gives us a lot of stability, resiliency and it really allows us to accelerate very quickly.
When I looked at our company, we're a 160-year-old institution, and now we're moving at the pace of a fintech. We're doing these things with open-source software like Knative. If you want to have that competitive advantage, you really have to do cloud, you really have to do AI.
And we didn't even talk about blockchain. Blockchain is a huge thing for us as well.
I'm ... quite a blockchain skeptic for anything else other than currency or crypto applications. What specifically are you using blockchain for?
We're using blockchain to secure our most critical asset. If you think about the immutability aspect, because it provides that transparency as well as an audit ability, it makes sure that your data is in an immutable state.
The other thing is emerging markets, where if you were to build something [new], why wouldn't you use blockchain? You have the ability to have that transparent interaction with your clients, as well as regulators. We see it as a critical component, but of course, security first. It's a security tool for us, and we're looking at other use cases, in terms of sharing data with our clients.
And you're using that in production now?
We have been using it in production for almost two years.