Software that is so old it should be on a beach in Marbella!
https://twitter.com/drrosena/status/1313066062284500993?s=21
Oracle Life Science Database engine with a SAS analytics engine pulling data via SQL. R is for amateurs. You get what you pay for.
I don't think the government IT budget would have allowed for that. **** Oracle and their ridiculous licensing terms! We've just deployed a 19 host VCF cluster, we had to deploy a separate 3 host vSAN cluster for Oracle DBs alongside it as even though the Oracle DBs would have only run in one of the VCF workload domains we would have still needed to licence all 22 hosts, which would have added £15m to the project cost (3 times the combined cost of the rest of the software and the hardware).
I think you misunderstand the “IT budget” that is at stake here. Compared to the opportunity cost of propping up U.K. PLC, the IT cost (and vaccine) pales into insignificance.
Have to take the longer view. And do it properly.
Looking at past datasets, there was a marked improvement in quality when they brought in the external experts. Little things like variable names were improved.
You guys are funny.
Hope you have got your tenders in for the new customs system that needs to be built. They would have snapped your hand off, "three months, easy".
P.S. Please look at this PHE website and tell me that this is all built on Excel:
https://coronavirus.data.gov.uk/
P.P.S. The IP address for that site resolves to Microsoft in Amsterdam. Shock horror Azure.
INRAT but I suspect it is pie in the sky software delivery stuff. As someone who has worked in IT for decades managing the delivery of software projects from expert teams of developers. I think I would stick with Excel.
On town planning, I would suspect its a way for local authorities to hold more power.
It's because it's been delegated to Local planning authorities since 1947 and nobody has put together the technical framework or law amendments to co-ordinate the data management.
Our lab system runs an SQL which dumps a load of data into Excel. I “clean” the data, convert to CSV then upload to the PHE portal.. I know nothing about computers!
The system in the lab isn't providing a fully processed data output. Therefore, it is being manually processed by somebody with a low level of confidence in their own competence and there doesn't appear to be any reviews/checking. Inefficient use of resource and prone to introducing error. I'd hope that this is just reporting data and such a process wouldn't be tolerated if people's health could be impacted if a mistake is made.
Confession: I sometimes open and read IT threads on here that I have no hope of understanding because I find it weirdly reassuring. I didn't intend for this to be one of those threads but it certainly is.
Also, am I right in thinking that this specific problem would not have happened if they were using a newer version of Excel?
Having worked on an NHS project at the start of Covid the one thing that's not being said is the pressure from on-high to deliver results in very short timescales
All the talk of setting up this in the cloud, configuring that and bingo perfect running system is great but at the end of the day some people have been set the task of delivering ABC by the next day after they'd already delivered XYZ 24 hours earlier
Sometimes chucking it into Excel and getting a person to crunch the data is just the way it gets done to meet the deadline - Not that i think it's right 🙂
Try not to think like that @stevious, try to think more about a large government contractor actually using technology that is fit for the purpose it is going to be used for. Even if Excel had the ability to use an infinite number of rows, it is still the wrong choice for this purpose.
I mean, I once ‘accidentally’ took out a large part of of the mobile network of a European capital city for an hour because I thought I was typing in another window. How we laughed (later, much later)
I did that once in China in the early hours of the morning. Typed 'shutdown -h now'* in a shell window thinking it was my local console - it wasn't. Cue a very frantic night of finding the guy with the key to the building where I'd just shut down the cell site, going to his house, waking him up at 2am, shoving him in a taxi and driving to 'work', wandering round a huge skyscraper in the dark trying to locate a comms room on the Nth floor. Finally locating the kit and restarting it. All through my assistant as I didn't speak a work of Chinese and no one, bar her, spoke a word of English.
* This was in the days when if you put RedHat on a laptop it didn't have any integration with power buttons etc and the only way to shut it down was by command line.
@willard, I get what you're saying and I don't doubt that the process used would have thrown up a different issue. But if they were using a newer version of Excel would we have seen this specific problem?
@willard, I get what you’re saying and I don’t doubt that the process used would have thrown up a different issue. But if they were using a newer version of Excel would we have seen this specific problem?
It may have avoided the problem they had, but whats to say something else didn't fall over in the process?
IMO, it doesn't matter what program you use. Having a robust process is key, bringing the data in, check it, process it, check it, make it into a pretty picture for people to digest, check it, release it.
Excel is the defacto tool, everyone has it on their pc and with a bit of training, can navigate round it.
I used to do a lot of Matlab scripting, was once asked if it could be done in excel, politely told said manager, that excel was akin to trying to knock a building down with a pencil. Once it was all processed, checked, calculated, checked, it waas then fired at excel, as that was found to be the easiest way to manipulate maps/calibrations without faffing with licences for other stuff.
Sounds like they were going from csv -> xls -> postgres. Nobody in their right mind would design this up front, not only is the xls bit a pointless step, the use of xls files is a mare for licensing, platform, and library support. There are also a gazillion things that will quite happily do csv -> postgres for you in a reliable/validated/audited way.
Assuming it is not total incompetence (assuming) then this stuff is the kind of stuff that only comes about because it evolves over time. That I can completely understand tbh. Still I expect they are getting some investment here sharpish.
Haven't read the whole thread, but has anyone suggested adapting Pokemon Go?
In the meantime some consultant with excel, python and possibly R will have knocked up a quick and dirty solution in a couple of days which answers the questions being asked by people in boardrooms who don’t care about IT. Guess which one they went with?
Pretty much sums up my career path, I create scripts which solve problems deemed too complex / expensive to fix and then my scripts become the official 'solution' and they take on a life of their own. 120k lines of VBA and 50k lines of bash later I have a full time job maintaning them 😉
I basically specialise in inappropriate software solutions...
You guys are funny.
Hope you have got your tenders in for the new customs system that needs to be built. They would have snapped your hand off, “three months, easy”.
Depending on the robustness, I suspect a specialist company would find it this easy. Its probably the paperwork and bureaucracy that would take the longest. In one of the collaborative projects I worked in, the data analytics took longer than they should, not because of the system itself-this had been ready for a while, but because the content/variables hadn't been designated clearly. I mean from the info of the website every row data set is what:
"name" "region1" "region2" "region 3" "test result" "date tested" "date test result" "hospital detail 1" "hospital detail 2" "hospital detail 3"
I'm no expert in data science, but the method they used is just one step above from printing all the sheets and putting them into a binder to then count manually. This is why people are so frustrated, there is whole spectrum of options between what they did and a fully functional deluxe industry standard cloud database(as the ones described above). The data is not what are joe doe's favourite colours, the data is concerning a global pandemic.
Next thing we know the excel spreadsheet is not backed up into the network and opened from a USB drive.
P.S. Please look at this PHE website and tell me that this is all built on Excel:
You could use excel to classify the data in the parameters given. Its easy to write a script in whatever programme/language to import data from excel. As I said earlier, excel is good at being a jack of all trades, however for this case it should not have been used.
I'm getting far too wound up about this mess issue....
MongoDB because the data is “big data” (complex and not very relational).
Hmm yeah.. with map/reduce jobs for the tracking and tracing part. I don't think that would take very long to implement tbh. I wonder if I could knock something up as a demo..?
Once the minimum viable product was up and running, then you could do other fancier stuff with the data since it's in a generic form.
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
Give me a few hours and I'll have a prototype in Excel up and running.....
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
A temp on minimum wage cut and pasting into excel?
You can use a lot of different ways depending on the formats/ways one will export data and the other will take it in.
JSON, AJAX, CSV, XML... You can go for a totally custom format and put it over a network using a REST API if you want, but it relies on people having a clear idea of what they want to do and how they want to do it; all have potential issues and advantages.
The example @footflaps gives is pretty much spot on for a lot of software that evolves over time from "handy script for chopping small bits of data" to a major part of back office. People _know_ that it needs a proper solution, but the cost of making the change, of ripping it all ot and replacing it with something that is designed for the expanded requirements, is far bigger than keeping one person in support hell. Hell, I've seen quotes of 18 months elapsed for a team of people to replace one shitty broker. A TEAM. Not just 18 months of dev effort, A TEAM.
JSON, AJAX, CSV, XML… You can go for a totally custom format and put it over a network using a REST API
Lovely. Very modern.
Now what if the lab's system runs on software that is 10 years old and was never designed to make REST calls to the internet? In fact, the design brief at that time was that it should be isolated from the internet due to the nature of the data.
See where I am going with this?
The very, very basic question is how do you get the data out of the various lab systems and into your fancy number cruncher?
I use lab equipment for my PhD. My results go from the inevitable Excel output file straight into a suite of Matlab scripts that I wrote. I've also built a Matlab database, which then exports into R for analysis*. I'm someone with no computer training, this just seemed like a way to save time** and avoid errors in my PhD. I naively thought that people doing more important things might have robust data systems in place, especially given the large proportion of the population who seem to work 'in IT'.
* Via csv file, admittedly.
** Classic coding 'timesaving' by spending weeks writing scripts to save a few minutes per experiment. I'll 'break even' in about 10 years I reckon.
But if they were using a newer version of Excel would we have seen this specific problem?
As I understand it it was more the Excel file format they were using (XLS rather than XLSX), even with a current version of Excel you can save as legacy XLS if you want. Generally you only do that for specific backwards compatibility requirements, that said the government agency I work in is using Excel 2010 but has the default file format set as XLS (don't ask). I struggle to believe PHE are using Excel 2003 (the last version before XLSX) but maybe they are...
Lovely. Very modern.
Now what if the lab’s system runs on software that is 10 years old and was never designed to make REST calls to the internet? In fact, the design brief at that time was that it should be isolated from the internet due to the nature of the data.
See where I am going with this?
That is what ETL is for. Regardless if the data source is legacy onsite system, CSV files or nice JSON via API. It will get it into your data warehouse / lake.
Everyone has to get out of the mindset of what they use, what is feasible in the lab or their office. Look at what Facebook, Tik Tok or Snap do. Whether you like them or not they deal with much more complex challenges, their management of data is pretty public and easy to replicate. Pretty sure if the Gov called an architect or data engineer at any of them saying they needed a little help they would have sketched out what to do on a napkin over a cup of coffee for free to help save the country and all.
What's doing the E for Extract bit? Devil is in the details.
Unless it's E for Excel, which is what they are doing.
ETL = excel, truncate and load 🙂
Legacy lab software designed to handle large data sets is fine even with no API.
The point here is that Excel should not be anywhere in the chain of a system meant to track the population of a country.
There are so many well known cases of large scale blunders caused by Excel. Here are a few for entertainment: https://www.bbc.com/worklife/article/20130903-five-awkward-spreadsheet-snafus
I’m bailing on this thread. Hope someone at the gov may find it useful. I need focus on actual cycling when I visit this place 😂😂
It's not even Excel at fault, whatever solution they use, no matter how (in)appropriate should have had a test process around it which spotted that half the data points vanished between steps x and x+1...
There never was any problem with any of the hardware or software.
The problem was totally and completely in the wetware.
Always has been, always will be.
ETL = excel, truncate and load
😀