TSB’s IT Fiasco and Some Implications for Banks’ Legacy Systems

Posted on April 26, 2018 by Yves Smith

We’re turing again to TSB’s botched effort to transfer all of its users to a new computer system over the past weekend, where the bank is still bleeding from a major artery.

The TSB Train Wreck Continues

In yet another effort at porcine maquillage, the CEO today tried claiming everything was fine.

Our mobile banking app and online banking are now up and running. Thank you for your patience and for bearing with us.

— Paul Pester (@PaulPester) April 25, 2018

That didn’t last long. Again from Paul Pester:

The challenge we are facing at the moment is that while we know everything is working, one of the main ways that our customers see everything is working – through our internet banking and mobile app – isn’t functioning as well as it should be, and for this I’m truly sorry. I can appreciate how frustrating this must be for our customers.

As Financial Times reader Paper Chase remarked: “The level of contradiction in this statement is beyond my comprehension.”

The bank later stated that only about half the online customers could access their accounts, as if the only problem was now capacity. Per the Guardian:

TSB said internet banking was operating at 50% of capacity, which means that for every 10 customers only five will be able to access this service. Mobile banking was operating at around 90% of capacity, the bank said in a statement issued at about 4pm.

It appears the bank has a different idea of what “operating” means than customers do. Twitter was rife with complaints from users that while they could finally get into their accounts, they saw alarming errors and/or had difficulty performing transactions. These tweets are from the late afternoon and early evening UK time:

Just entered log in details to get this message #tsb @TSB this is beyond a joke now! When I eventually got in it now won’t let me make payments #TSBFAIL pic.twitter.com/M7Kp966T4H

— Cheryl Ewing (@shallotohgravy) April 25, 2018

#tsbfail ERROR_SISTEMA showing at 18:01 so still cannot pay my Utility Bills! I hope the Utility Companies appreciate this is TSB and not me !!

— Mary George (@marygeo6) April 25, 2018

@PaulPester @TSB TSB are complete liars! I’ve been on hold trying to get though for 45-50 mins now and keep getting cut off! Don’t tell you have a problem with your phone lines now? I still can’t make a transfer on my mobile banking app! Sick to death of it now! #TSB #TSBFAIL

— Anne Morton (@amorton66) April 25, 2018

Oh @tsb, to say these problems are down to ‘bandwidth’ really sounds dubious. And the lack of incident management on this is almost unbelievable. As someone who works in IT I’m horrified at how this has been handled! #tsbfail

— julie ballantyne (@julieballantyne) April 25, 2018

It's getting worse as the day goes on. I can't login online at all now!! I'm locked out of the app as its trying to text me on a number I changed 3 years ago. To say the system is running is complete bollox @PaulPester #tsb #tsbfail pic.twitter.com/6AlLQlOR6d

— Jobseeker (@Anonymous_Nottz) April 25, 2018

Needless to say, the potential damage goes beyond late bill payments and inability to access funds, which bites for stay-at-homes who can’t go to a bank branch and people with large expenses in coming days, such as for a wedding. CEO Pester also claimed that TSB’s “engine room” was working fine and that scheduled payments and transfers were all being processed.

Given that the top management of TSB was high-fiving a successful launch over the weekend, its ability to make accurate self-assesments seems pretty impaired. As Richard Smith and some parties on Twitter noted, there are signs of data mapping and potentially even data corruption, such as some customers getting logged into other individual’s accounts, quite a few reports of mortgages disappearing or having incorrect information. These complaints in the Guardian also contradict Pester’s cheery reassurance that transactions are being processed correctly despite appearances otherwise:

Small businesses were unable to pay salaries or manage transactions, while some account holders found all their direct debits had disappeared and others reported that their cards were declined when shopping.

And Clive pointed out earlier in the week that Pester’s own remarks acknoweldged that some, Lord only knows how much, customer data has been lost:

But the bigger picture which is obscured by the “helping us to help you” advice from the CEO is — excuse me — WTF is a bank which cannot maintain an accurate ledger or record of customer product holdings? And Pester’s statement is tantamount to an admission that, for some correcting entries, they are going to have to rely on a customer telling them what they (TSB) need to go away and put right rather than the bank being able to reconstruct a valid account history, apply misposted transactions or reunite orphaned product holdings with their rightful owners.

I for one have a rough idea about what should or should not have been posted to my account — but I don’t keep a comprehensive shadow set of records. If TSB has so broken its CRM and accounting data that it needs customers to tell it what’s missing then customers will have lost money for sure — or else have credits on their accounts or be assigned ownership of products which aren’t really theirs which they might inadvertently draw on only to find out much later down the line they have to reimburse the bank because they weren’t entitled to the funds.

Richard Smith flagged this as one of several examples of a “spectacular example of a data mapping problem”:

#TSB I can't get through to customer services, been cut off after 15 mins on hold. My user ID had been swapped with my memorable info and my password isn't recognised so tried to change it and told new one can't match the old? It's been 5 days & I can't eat/travel I need my money

— sarah clark (@sarahcl42953252) April 25, 2018

I would assume that data mapping problems don’t often occur in isolation.

So TSB has data integrity problems. On top of that, you have the potential for hackers to make off with funds or do damage to records while trying. From the Financial Times:

In its Sunderland call centre, stressed staff were offered free fruit to keep their spirits and vitamin C levels up as they dealt with thousands of angry customers. Two people with knowledge of the situation said workers lacked confidence in the security checks intended to weed out fraudulent attempts to take advantage of the confusion. Several staff walked out.

Let’s step back a bit and see what this says about the state of banking.

How Did This Screw-Up Happen?

On the one hand, as the sorry story of TSB’s cack-handed effort to move to a new system is likely to show, that Banco Sabadell, which bought TSB in 2015, is particularly incompetent in managing bank systems. On the other hand, as we’ve discussed before, big IT projects regularly fail, with the admitted abort rate at about 50% and the actual probably closer to 80%. Yet the way banks have been running their systems for decades, by not making adequate documentation and investments, seems virtually guaranteed to set them up for huge train wrecks.

Signs that Banco Sabadell is particularly clueless even by the low standards of bank IT managers. Since regulatory and Parliamentary proctology is virtually guaranteed, we’ll know more about TSB/Banco Sabadell’s pathologies in due course. Nevertheless, some of the statements of bank executives, combined with examples of error messages, demonstrate that the Banco Sabadell top brass is particularly clueless. Aside from the repeated premature declarations that everything was fine or nearly fine when it wasn’t, consider:

This acquisition should never have been made. In the US, we’ve had way more bank mergers than anywhere in the world by going from a highly fragmented banking system (over 16,000 banks in the late 1980s, and even that number reflected some mergers in during that decade, to about 6800 in early 2014). That much trial and error means financial services industry executives have learned what not to do. One is to pass on a bank acquisition when the systems integration looks hairy.

TSB was a mid-sized bank deal, with £42 billion in assets at year end 2017, or about $60 billion, and roughly 550 branches. By way of comparison, Canadian bank TD ate my somewhat larger-than-TSB-bank, Commerce Bank, in 2008. The only IT hiccup I’ve seen is that they too often make a mess of wire transfers.

There are some red flags that Sabadell and TSB systems had major compatibility issues. The first was that TSB’s former parent Lloyds had entered into the unusual arrangement of running the TSB systems for Sabadell for years after the 2015 sale. Second was Sabadell’s bold pronouncement that it was going to move TSB customers to a “brand new core banking system“.

The timetable as insane.The Financial Times published this astonishing (or maybe not in light of what happened) tidbit:

A person briefed on the TSB board’s plans insisted the platform had not been rushed out, and said questions over who was responsible would have to wait until the system was running smoothly for all its customers: “Given the testing they didn’t think this was where they were going to end up — they never would have pushed the button otherwise,” the person said.

Huh? It’s not exactly clear when Banco Sabadell took operational control of TSB. Even though Sabadell acquired all TSB shares on August 21, 2015, an August 29, 2015 Guardian story describes Lloyds having to divest TSB branches on behalf of Sabadell, with the headline TSB to shut 17 branches before takeover by Spanish banking group Sabadell and says in the body that the deal was yet to be completed. Sabadell would not be asking Lloyds to handle divestitures if closing were in days or even weeks. So our assumption that Sabadell took operational control at the end of September 2015 would be generous.

That would mean two years, seven months for this “brand new core banking system” to be built. Contrast this with vlade’s comment on project timelines:

Just a version migration of the same system was easily a year/two programme. A new system introduction to replace a legacy system was 3-5 years easily..

I was also involved in post-sale migration at the time, and it was also a 18months + programme, even if it was rather simple (as in maybe a few thousands clients, and few tens of thousands transactions were being moved).

I’ve also seen a replacement of the retail system done (but wasn’t directly involved), and again, it was a pretty few years worth. Customer base would have been about the same size as TSB.

It looks at if Sabadell was sold more than a bit of consultant hopium. From Financial Times reader Stephen:

Apparently Banco Sabadell’s Proteo is based on Accenture’s alnova banking package, which is an old COBOL system modified to run on Unix under .NET. Sabadell’s new Proteo4UK version of it runs in Amazon cloud.

Amusingly, Sabadell released a self-congratulatory press release on Sunday afternoon, now preserved for posterity in Google’s cache:

Banco Sabadell successfully completes TSB technology migration…

For TSB, the Proteo4UK migration project is the best strategic decision it could have made when compared to the technological alternatives considered at the time, as the platform fulfils its business requirements and supports its strategy. Total synergies arising from the migration are estimated to amount to £160M on an annual basis.

The new Proteo4UK platform offers increased operational simplicity, which, for example, enables significant time savings in the implementation of the principal business and operational processes. The new platform also generates new opportunities, such as the launch of the SME business and enhanced user experience thanks to the digitalisation and homogenous omni-channel deployment of products and services.

Testing was grossly inadequate to non-existent. Richard Smith, without looking hard, found ready evidence, such as BeanCreationException. More generally, to IT pros, the fact that Twitter is full of screenshots of customers actually seeing gnarly error message is a gasp-out-loud programming failure. As our Otis B Driftwood said yesterday:

Those TSB messages violate a cardinal rule of software development to NOT expose unexpected internal error information to an end user. They should never have passed design and code review and QA testing.

Reader visitor added:

The errors revealed through these messages should never have passed unit testing in the first place. Accessing an array out of its bounds? Dereferencing a null pointer? Attempting to use a connection that is not set up?

All those circumstances indicate that the programmers do not initialize their variables properly, or are using them in ways that get them overwritten / deallocated between procedure calls — and in addition do not even attempt to check for abnormal situations.

Or perhaps they are relying upon a third-party application framework that is riddled with those problems.

What Does This Ciusterfuck Say About Bank IT?

Given the discussion above, we are likely to learn officially that Sabadell did every bit as lousy a job of managing this migration process as we can infer from our remove. But our hypothesis that this would have been beyond the ability of even a much more technologically competent bank to execute.

Several readers pointed to the fact that RBS allegedly spent over five years and £1.5 billion in a failed effort to clone its systems so it could divest roughly 300 Williams & Glyn branches. That should have been easier than what Sabadell was trying to execute with TSB. As Financial Times reader SpeedFerret noted in early 2017:

Who in their right mind would buy a stand up bank of just 300 branches with a sprawling technology stack of over 600 (yes 600) ancient applications to support it. A very poor decision to come up with this strategy in the first place. £1.5bn thrown to Indian IT offshore consultancies to untangle the spagetti of systems and overnight batch jobs came to nothing. A further £750m to be handed out as dowries. All we need now is the ATM network to be out of action for a week over the Easter holidays.

That gives a tiny vignette into the real problem, which we hope to chip away at in later posts, which is that bank systems are a greater mess than pretty much anyone on the outside imagines and are kept running with liberal applications of duct tape and baling wire.

The very short version of “How could banks have possibly gotten themselves in this mess?” is that IT has never been treated as mission critical when it is. Banks have underspent in the systems area for decades. The biggest symptom is lack of documentation. It costs money and adds to project completion time (although it reduces life cycle costs, but who cares about the long term?). The second is rushed projects with bad code in them that only kinda-sorta works. With systems in production, it’s almost always perceived to be too costly to identify, rip out, and redo the botched parts. Instead, the IT pros resort to patches and work-arounds. That adds to complexity and fragility, particularly when the system gets stressed or large changes are made.

The problem of systems management might be a smidge worse in the UK than the US if the erosion in the managerial capabilities of what passes for the elite that we see in its political classes has occurred in paler form in the private sector. Clive has spoken of a style of managerialism that holds specialist knowledge in contempt and views people that try to point out why certain goals are unrealistic as not clever enough. Mind you, we have that in the US thanks to management by MBA, but the rot may be less advanced, since most MBAS recognize that they are exploiting underlinings with expertise and posture that their skills really don’t matter much, while their British counterparts may believe that on a much more visceral level.

Print Friendly, PDF & Email

Subscribe to Post Comments
40 comments

The Rev Kev April 26, 2018 at 5:51 am

Oh my! After reading through this whole article, profanity seems too weak here. I may be too hasty but perhaps it will be necessary for a consortium of banks or even the government to step in and take over the lot lest it damage the credibility of the UK banking system – if such a thing is possible. This sounds like the Obamacare roll-out debacle.

Hey, I have a thought. Would not the British spooks have a complete record of all transactions on their servers? They spy on everything else apparently. Even if they could only provide a snapshot of the system before the new system came online, it might be some sort of starting point. Either them or the NSA.
1. Oregoncharles April 26, 2018 at 2:06 pm
  
  It’s considerably WORSE than the Obamacare rollout, because it’s an operating system that they can’t just put on hold until it’s ready.
  
  At least in the US, it sounds like they would have already created enough legal liabilities to bankrupt the bank. It would make an awesome class-action suit.
notabanker April 26, 2018 at 6:17 am

If this is truly cloud based and that has anything to do with the problems it will be a massive blow to that platform in the UK.

There is a lot of internal opposition to cloud anything inside the banks and the UK regulators were most supportive. The big techs then do what they do so well and lobby board and senior execs. This could set their efforts back 3-5 years.
1. mosschops April 26, 2018 at 10:15 am
  
  That’s not true, most of the big UK banks are pretty keen on the cloud due to the cost savings, the thing preventing large scale adoption is the lack of regulatory sign off at the moment.
2. TheMog April 26, 2018 at 11:20 am
  
  It’s hard to tell from the outside if the problems would be connected to running in the cloud (aka “someone else’s computer”) compared to running in your own data center. I suspect that even if the problem would be connected to cloud usage (like the front end not being able to deal with the backend server going away and having to switch to a different server, which can also happen in your own datacentre), they likely would have manifested themselves even if the system hadn’t been moved to the cloud.
  
  The fact that they ran into issues with users logging in and ending up seeing someone else’s account suggest bigger issues.
  1. Oregoncharles April 26, 2018 at 2:08 pm
    
    And who thinks that something called “the cloud” might be secure, especially with billions of dollars at stake?
vlade April 26, 2018 at 7:13 am

Cobol interpreted by .NET running on Unix????? Implemented by Accenture?

The last point alone is at least semi-guaranteed fail. Combine with the former, it’s 99% fail IMO.

I’d like to know whether it’s still using good ole Cobol structured files, or at least attmepts to use a database – the former could easily explain a number of the data issues. It’s hard to migrate even between two SQL databases (mapping is usually non-trivial, especially if any of the system was “customised” in meanwhile which is common), but to do it between a Cobol file structure and SQL is hell.

What most successful migration I saw did was they migrated thing at a time.

So for example, TSB could identify small product segment, I don’t know, personal loans (nice standardised product, relatively predictable cashflows, not that hard to roll back, etc. etc.), not much that you need on IB (if anyting) – or you run a separate IB for that.

You migrate those – and yes, you’ll run into problems, but it’s a small sample of your clients, there are few payments issues (from you to the client), and if a payment from the client to you gets lost, you accept it as a cost of the migration. Once you bedded that down well (which means at least 6-12 months from going live), you look at what other products you can migrate safely (maybe saving accounts and then mortgages). You’d migrate current accounts last, as that’s the trickiest bit, and one that can cause most damage if it goes wrong.

But all that takes time, and is much more expensive than big-bang approach (if BB works, of course). TBH, it also runs the significant risk that after product A you find out the new system doesn’t work for you, and you end up even more fragmented than before.

Pilot across everything is a nice idea in theory, but in practice IMO it doesn’t work, as a small population sample on a number of products is worse than a full population on small number of products.
1. Jesper April 26, 2018 at 7:59 am
  
  Yep, Accenture and banks:
  http://www.thepropertypin.com/viewtopic.php?f=4&t=66785&hilit=accenture&sid=bb10eaab39b58d5d5994e61d3e7445fe
  Anyone able to see a trend? Am kind of wondering if a particular name will pop out around TSB….
2. oliverks April 26, 2018 at 9:39 am
  
  I saw the COBOL on .NET under UNIX and also thought WTF. I don’t know much about bank IT, so perhaps this is normal. But if you brought that project to me, I would think something has gone pretty horribly wrong without even knowing what the project is.
  
  I am wondering if Bank IT is so messed up because they started using computers earlier than other industries. So this led to many odd and strange hardware / software combinations on the then emerging technology. As software lives for ever, it had to be ported and reported to the point where it is pretty wobbly by now.
  
  An obvious question is how much do banks spend on IT? If the cost per customer is higher than the acquisition cost of a customer, it might make sense to start a new bank. The new bank could use a clean sheet code base, which isn’t using COBOL on .NET running under UNIX.
  1. YankeeFrank April 26, 2018 at 10:16 am
    
    The one reason I’ve seen a bank run .net on unix is because their sysadmins were unix guys and didn’t want to have to host windows servers. There is a good port of the .net virtual machine to unix — the mono project, but as I recall (haven’t looked at it in years) it was always a few versions behind the windows version, whatever that’s worth. I stopped coding .net a long time ago.
    1. oliverks April 26, 2018 at 11:39 am
      
      .NET for Windows programming absolutely makes sense, but was someone really trying to develop a Windows program with COBOL?
      
      If you are not trying to develop a Windows program, why chose .NET (or mono) to run under UNIX? Did it move to Windows, and then to Linux?
      
      I know COBOL has evolved (I believe it even supports objects these days). But are people trying to write Windows apps with it?
      
      Something just seems to have gone wrong here. My mind is boggling. My wife on the other hand sees lots of billable hours ahead on this project.
      1. YankeeFrank April 26, 2018 at 3:52 pm
        
        I believe the cobol was the legacy code that handles the account ledgers, and it runs on unix for the TSB system. It wasn’t running on windows but it might could as some kind of console app, though why is anyone’s guess.
  2. flora April 26, 2018 at 10:48 am
    
    …the COBOL on .NET under UNIX …
    
    That caught my attention, too. First thought was “if they don’t understand what the COBOL program is actually doing (and it’s probably the deepest original layer of the legacy code) then porting it to another platform, even a Cloud platform (Cloud is not a magic black box), won’t solve the problems. I have a hunch the TSB transition managers do not understand what the COBOL program is actually doing or how it syncs with all the other applications several layers out.
    
    One rule of thumb (in general): If you can’t see it you can’t fix it. Without being able to see how the system works in all its complexity you cannot see the problem clearly. You’re left with endless theorizing and solutions that look like kludges, if not something worse, efforts that look like whack-a-mole.
    
    This is no criticism of the individual programmers.
  3. lambert strether April 26, 2018 at 2:40 pm
    
    Me too, on the COBOL on .NET under Linux
    
    I hope they lured some greybeards in off the golf course to decrypt The Great Runes, but I’m guessing no.
    
    Pass the popcorn
rfdawn April 26, 2018 at 7:33 am

Swerving only slightly off-topic here, my unnamed browser slowly displays the TSB login page with this:
“Flash was blocked on this page.”
As it should be. What’s wrong here is that flash should not be there at all, because of things like this, this, and many more. Now, TSB’s own flash content may be flawless but installing or enabling flash on my platform is kind of risky and has been for a long while. Basic TSB web functions probably don’t need it although the “TSB online banking demo” clearly does. Just having this stuff onsite is an incitement to customers to enable flash and that’s a very bad idea.
1. YankeeFrank April 26, 2018 at 10:13 am
  
  I see this flash messages on many, many sites that don’t really need it. I’m not sure why it happens honestly — whether its the browser screwing up or some legacy code or whether the site is doing something nefarious like trying to use flash for secret cookie storage. Probably the latter.
funemployed April 26, 2018 at 7:37 am

I can’t help but sympathize with the great many drones whom I assume knew darned well that a clusterfamilyblog was inevitable, yet swallowed pride and burned some uncompensated midnight oil anyhow out of professional integrity and company loyalty.

Bad information flow is just wonderful for morale (assuming /sarc tag not necessary here), particularly when its source is managerial hubris.
1. Disturbed Voter April 26, 2018 at 12:42 pm
  
  GIGO is systemic, incompetence isn’t necessary, though it doesn’t help either.
  
  It has been known for decades how to do this kind of migration, but to save money …
  
  So yes, similar integration problem as Obamacare rollout … it is very hard to make this work …
  
  Large computer projects have a 50% total failure rate (that doesn’t mean late, over cost, under powered … but complete cancellation).
Thuto April 26, 2018 at 7:39 am

Wow, this problem is endemic. Upon reading this article I called a friend who’s working deep in the bowels of banking IT here in sunny SA. He’s due to deliver a major migration project end of July but internally the sober realization is that it ain’t happening, not by end of July, not by end of the year. The system that’s meant to be decommissioned supports 25 front end systems which in turn interface with other systems and switching it off prematurely could spell major disaster for the bank.

His assessment as a battle hardened banking IT guy: the finance guys running banks aren’t equipped to run the operations of what a modern bank has become, a data/IT/software company with banking products bolted on top. IT sits at the very core of a modern banking operation and as such banks should be run, or at least co-run by engineers with finance people focusing on designing customer facing products (not the current situation where IT is sneered at by the top brass and starved of the budget to secure the nuts and bolts of an operation that pays said top brass their fat bonuses). I can almost guarantee that the technical debt incurred by rushing half baked systems out the door and into production is always at the insistence of managerial “know it all” types holding up a Steve Jobs biography and educating everyone about the “reality distortion field”. As any software engineer knows, when, not if, technical debt bites, it bites hard and deep…
1. vlade April 26, 2018 at 8:08 am
  
  TBH, the problem is on both sides. I saw not exactly rarely IT people holding the business to ransom (“I don’t care what business wants/needs” is a direct quote), while playing with their toys, i.e. bringing in new technologies just to get it on the CV w/o an actual need. All of which makes the systems even more complex than they ever need to be. Fief crating etc. also works great in increasing the complexity.
  
  The problem is that before IT, the complexity was limited by what humans could reasonably understand themselves – there were few, if any businesses that could not be truly comprehended by a human.
  
  IT allows unimaginable additional complexity which is almost all hidden. That is the cost of its flexibility – that the flexibility can collapse on itself in a mass of flexible chaos.
  
  One of the reasons IMO why apple was so successful was not because it gave choices, but because it took away choices.
  1. YankeeFrank April 26, 2018 at 10:24 am
    
    I agree but remember, they started it!
    
    I’m only partly serious but the first big bank I worked at was full of “project managers” from the “business side” who treated the IT staff like fungible dirt. Get treated like that a few times and “what’s going to look good on my resume” or “that’s a nifty tech thingy, let’s try it” start to become habits. And remember, if the managerial staff was at all tech literate they would lock down and not allow unapproved tech into the ecosystem in the first place. So I’m going to go out on a limb and say the biggest problem in bank IT is exactly what Thuto’s friend said.
karen April 26, 2018 at 7:43 am

This is probably a naive question (I am not an IT person) but…at a certain point isn’t it easier (and therefore cheaper, certainly less risky) to build a new system from scratch, based on current best practices…then ask groups of users to migrate over in manageable batches, troubleshooting along the way?

It strikes me that endlessly innovating technologies for marginal improvement, without adequate compatibility requirements (move fast and break things) is creating a truly dysfunctional level of complexity in our society. As in societal-collapse level of complexity a LA Joseph Tainter. I feel a public backlash coming, but as of now people mostly feel powerless, as these “upgrades” are normalized. For those who cannot pay their bills…what about a good old fashioned cheque?
1. vlade April 26, 2018 at 7:50 am
  
  Yes in theory. But the main assumption here is that you actually know what your old system does. Or, if you don’t, that any functionality lost is non-critical. Which is rarely the case.
2. notabanker April 26, 2018 at 7:51 am
  
  You mean like this?
  
  https://www.forbes.com/sites/tomgroenfeldt/2013/12/17/bbva-compass-first-new-u-s-core-bank-system-in-a-decade/#274a459c2a98
3. Silence Dogood April 26, 2018 at 1:49 pm
  
  In theory. But that is about as far as it goes.
  Migrating decades old functionality is hard enough, but no documentation (as has been said is a killer). There will be functional units (of code) that do some critical operation, but no one will know how (or even why).
  Also, COBOL migration? If BCD (binary coded decimal) is used, god bless those rounding errors when data is now floating point.
  AND, without domain experts (what used to be called Business Analysts) to assist coders in understand WHAT is required, don’t expect a CS graduate to be good at creating a functional system/sub system that does what is needed accurately. No knock on CS; every time I have walked into a new business/project, it has taken me at least 6 months to understand nomenclature and processes.
  Lastly, comments posted indicate responsibility probably lies both with management and IT.
  I concur. I once worked on a project where some of the developers stated “That’s not going to happen” when a potential failure point was noted. Needless to say, 9-12 months after system was in place, it happened .
  1. Oregoncharles April 26, 2018 at 2:21 pm
    
    Murphy’s Law.
    Closely related to the Law of Unintended Consequences: there always are some.
  2. David May 15, 2018 at 5:48 am
    
    I once worked on a system where we quickly showed that, if all the required operations were run one behind another, the comms. link could not run the traffic in the time allowed. We could do it, if we were allowed to overlap operations, but the auditors refused: each step had to be locked down, committed to the audit file, etc., before the next could start. Shortly after, I left…
    A few months later, I got a call “How would you like a job in the UK?” Turned out, the client was the customer for that previous job, now out trawling for some poor mug to make it work to that impossible spec.
Watt4Bob April 26, 2018 at 8:12 am

Legacy bank systems have their roots in a different time, and the culture that produced them has since been supplanted by one that is not only less competent, but criminigenic in nature.

The whole process has most likely been doomed from the start by an inability to understand that the neo-liberal religion practiced by our financial system leaves them intellectually handicapped.

When faced with the necessity for prudence, honesty and hard work, they are at a loss, as they have been trained to accept imprudent cost cutting, deceit, and shirking responsibility as ‘natural’ human nature.

Throw ‘buddy-deals’ into the mix, hiring your friends and connections to do work they are incapable of delivering, and ignoring the danger, because IBGYBG.

The neoliberal consensus is incapable of admitting that its inherent corruption is a disaster no longer waiting to happen.
visitor April 26, 2018 at 8:28 am

A frequent recommendation regarding the “too big to fail” banks has been to slice them into smaller entities that would not represent a systemic risk just because of their size.

Do the failures of Lloyds/Sabadell re: TSB and RBS re: Williams & Glyn to transfer a subsidiary to another organization because of the IT morass imply that such an approach is technically next to impossible for large banks? Are we just condemned to live with TBTF behemoths just because we cannot get a grip on their jungle-like IT?
1. vlade April 26, 2018 at 8:42 am
  
  related but not. It is hard to slice banks where you need to duplicate systems – although slicing by the business (i.e. separating the investment banking part from retail) can often be done relatively painlessly (the systems tend to use the same payments gateways, but otherwise they are quite separate).
  
  That said, there are always ways to do this – for example, by creating a new banks with flash new systems and all, and then moving people there on their own will (maybe encouraged by some incentives). The problem is that if you want to do that, you’re immediately creating two competing fiefs, and that has implications of its own.
TheMog April 26, 2018 at 11:13 am

Writing as a somewhat jaded software development veteran of 30+ years, about half of which was in investment banking so far, there are a few things surfacing in this mess that just make my head spin.

First, COBOL on .NET on Unix in the cloud has a few buzzwords too many in it already. COBOL on .NET I may be able to stomach, but running it on Linux (which is most likely what is meant by the “on Unix” part, let’s just give them the benefit of doubt that they didn’t go with a proprietary Unix, which is unlikely these days) only makes sense as a cost cutting exercise. .NET’s native environment is Windows, and while there are solutions to building and running the software on Linux (.NET Core or Mono), I would not be surprised if there was an impedance mismatch between developers writing and debugging the code on Windows, then “throwing it over the wall” to those Linux geeks with the instruction of “just make it work”. This doesn’t mean it can’t work, but usually requires trading additional money and time for development and testing at the front end for lower cost at the back end.

Second, the infamous BeanCreationException is usually a Java exception, not a .NET one, and suggests that there are other bits bolted onto the system that are not COBOL on .NET. That it itself is not that much of an issue _if_ the integration has been properly tested. I think by now we know that that hasn’t happened. Plus, enterprise Java code these days often means “written by the cheapest outsourcing company our bean counters could find and engaged over the objections of the IT department”. Unfortunately in my experience that often means hiring outsourcing companies in countries that are culturally attuned to never rock the boat and who will implement exactly what you asked for, no matter how nonsensical it is. Works fine if your specifications are perfect. They won’t be.

Third, and I think that is the real failure here – the first rule of data and systems migrations is that you never, ever do a big bang migration, especially not without a fallback plan. The correct and more expensive way to do the migration is to migrate your users piecemeal while running both systems in parallel with (ideally constant, but probably daily) reconciliation between the two systems. That allows you to spot data errors quickly and worst case, switch the affected users back to the old system while you figure out what caused the data inconsistencies and fix the code and data before everybody is exposed to it. We don’t know if the parallel run happened here in the first place and even if it did we know now that the reconciliation isn’t up to snuff.

Oh, and it’s called a big bang migration because that usually accurately describes the crater left behind when (not if) something goes wrong. This is not a risk that conscientious IT professionals should take with other peoples’ livelihoods and money.

As an aside – I haven’t lived in the UK for almost a decade – are banks in the UK pushing as hard for their customers to switch to “electronic statements” as they do over here in the US? Because that would add another fun dimension for those people who then didn’t carefully update their financial data locally on a regular basis.
1. oliverks April 26, 2018 at 12:36 pm
  
  It could be they are using a tool like this?
  
  http://www.gtsoftware.com/products/netcobol/netcobol-for-net/
  
  It seems to allow you to mash up an unholy alliance of COBOL, OO COBOL, C#, .NET, and Cloud computing in one easy to swallow pill. Get your recommended daily allowance of buzzwords with one simple purchase.
2. ChrisPacific April 26, 2018 at 7:12 pm
  
  Well summarized. I would add that properly testing your integration ranges from somewhat challenging to incredibly difficult, depending on the toolsets concerned, developer relationships, management understanding and a whole host of other factors.
  
  I’ve seen the COBOL interpreted/compiled by XXX pattern (I’ll charitably call it that rather than an anti-pattern) before. It generally arises after someone calculates the cost of replacing the COBOL code entirely and finds it to be truly eye-watering and orders of magnitude over what was budgeted. So they come up with a scheme for either interpreting it at runtime, converting it to a different language in automated fashion or something similar. This allows them to avoid the big up front cost, at the expense of massively complicating their architecture and saddling themselves with an ongoing technical debt overhead (and unless they plan to run COBOL forever, they haven’t eliminated the big cost, only delayed it and made it even bigger). But big companies are notorious for underestimating TCO on decisions like this, so it looks like a win to them.
  
  This is not that scenario – they’re actually moving to a ‘new’ product that has this architecture that you normally only see from badly-planned legacy migrations. The most logical explanation I can think of is that the ‘new’ platform they have been sold by Accenture is in fact itself a legacy system that has been poorly or incompletely modernized. Like you I don’t even know where to begin with the .NET on Unix aspect, or the fact that we are somehow seeing Java exceptions from an architecture that isn’t supposed to have any Java.
  
  Your comment about the meaning of enterprise Java explains a bit about my conversation with lambert and fajensen a couple of days back. I feel fortunate that I am able to work with colleagues and clients that (mostly) know what they are doing, and aren’t outsourcing.
Paul Harvey 0swald April 26, 2018 at 12:54 pm

I see signs that this is pandemic. Slightly off topic – and certainly anecdotal – but I have received notices in the last 10 months that 1) my vehicle insurance will be cancelled in a matter of days and 2) my health insurance will be cancelled without a premium payment immediately. I had set up automatic payments in both cases that had been functional for years in both cases. The health insurance was deducted from my bank account, i.e. paid, but the vehicle insurance was not. I wasn’t aware of either problem until I received respective nasty-grams in the mail – past the cut off dates, and well past the due dates. Code is indeed law.
1. Silence Dogood April 26, 2018 at 1:55 pm
  
  I have noted similar happenings.
  Invariably, when ever a application/service I have been using stops, I immediately posit: “There has been an upgrade and they broke all the existing settings/configurations I had.” And sometimes the time and effort to recreate is not insignificant. It makes me wonder what was thought process about roll out for an existing system. Obviously, little or none…
skippy April 26, 2018 at 4:15 pm

Upon reading my mind keeps referencing episodes of Black Adder Goes Forth for some reason.

Anywho seems similar to watching construction over the last few decades, progress[!!!!], youthful new energies thingy, people that actually do a thing are replaced by the people that sell a thing in system admin, etc.
AEL April 26, 2018 at 6:34 pm

I agree that revealing internal error messages is not normal practice.
However, under extreme duress almost all applications will do exactly that.

I have done load and stress testing of many applications for a large organization.

Take for example an out of memory problem.
When this happens, the entire state of the application is now unknown.
Alas, in order to correctly sanitize the error message it takes more memory (which, you may or may not have -heaps are wonderful things).

This can produce another error.
(except of course this error happened in error processing, so different rules apply).
The additional error also requires memory which can produce another error (and since we are doing error processing within error processing inside error processing, different rules again apply.).

If you add in other constraints (like too many TCP/IP connections, database connections, intra process locks, etc etc all cause by extreme duress) you can get all sorts of emergent behavior as different errors interact.

Normal unit testing will *not* predict how the system will behave under extreme loads and it is very difficult (i.e. expensive) to write an application to behave sanely under multiple types of resource exhaustion. And of course, one application “behaving sanely by, perhaps simply shedding load and refusing connections it can not support” can simply shunt problems to some other application which then shatters.
1. vlade April 27, 2018 at 3:50 am
  
  Sorry – internal debug messages should never, ever make it into the production browser stuff – especially of this type. That’s what log files are for.
  
  Robust error handling is a base of good design and programming. Unfortunately, most people find it boring, and concentrate on delivering features – which is also what their users and managers want until things blow up.
  
  TBH, this is even worse with most of so-called architects, who believe that their main role is to draw nice pictures with boxes that have arrows comming in and out.
  1. AEL April 27, 2018 at 9:30 am
    
    Vlade:
    I understand what you are saying and agree that internal error messages should not be displayed.
    However, my experience is that under extreme load, it is virtually impossible to always do that.
    
    Design and programming practices are only partially helpful here.
    
    At a certain level of duress the system will lose its sanity. Thus the application will no longer do what you programmed it to do, but will follow where ever it’s corrupted internal state takes it. At that point you typically wind up with messages from application and system layers *below* the level of your code. I.e. you are no longer in charge of what comes out.
MichaelSF April 27, 2018 at 1:15 am

We’re turing again

That seems an appropriate typo for this subject.

Comments are closed.