Data and DevOps Digest, episode 4
Welcome to Data and DevOps Digest, brought to you by cloud consultancy, Vivanti. A news and analysis podcast, we cover the trends, thought leadership and announcements happening in today’s Data, DataOps and DevOps space.
Episode four looks at five areas that went through the industry news cycle over the past fortnight:
- Regenerative models and synthetic data pipelines
- Comparing Deep Learning and Machine Learning
- DevOps and Cloud InfoQ Trends Report
- Will AI rule the world?
- Data science vs decision science
Introducing: Data & DevOps Digest!
Hello, and welcome to Data and DevOps Digest – a news and analysis podcast about the latest data-driven innovations; the trends and technology impacting the people who develop them.
This is all brought to you by cloud consultancy, Vivanti. I’m your host, Lachlan James. I’m also joined by Vivanti Principal Consultant, James Hunt. James: It’s a pleasure to have your company again. I know it’s a Monday, so if you’re like me the cogs aren’t quite turning yet, BUT – do you notice anything different?
I’m on the wrong podcast! What’s all this about data?
[Laughs], for those of you playing at home, you might’ve noticed we’ve cast the net a little wider, updating our name ‘Data and DevOps Digest’.
This gives us scope to cover some of the enthralling developments coming out of the data science and AI realms, whilst also running afoul of these guys – devopsdigest.com – a news site also dedicated to the happenings within the global DevOps community.
James: It’s almost as if we should have done some research before diving in…
Apple Computer was named after Apple Records, and they did ok. But since neither of us have the charisma of Jobs or the genius of Woz, a name change is definitely the correct route.
Regenerative models and synthetic data pipelines
This was a topic recently discussed on astronomy.com, known to those in the astronomy field as ‘Galactic Forgeries’. We’ll also come back to astronomy.com later in today’s episode.
Whilst most ML models are built to help programmatically distinguish between different types of data, the ML models creating so-called ‘Galactic Forgeries’, are designed to produce new data.
These generative models are a subset of AI techniques that create artificial data products, such as images, based on some underlying understanding of the data used to train it. They can be used to feed more traditional ML models with fresh training data, so that they can continue to improve accuracy and efficiency.
Gizmodo.com even dissects how similar capabilities developed by OpenAI gained notoriety by automatically inserting Garfield into episodes of Seinfeld. Trust me; it’ll make way more sense when you check out the article.
So, if I pull myself back from the blackhole of astronomy, TechRepublic.com also published a piece called Synthetic Data: The Future of Machine Learning. I thought this was a really instructive article on the topic, so I encourage our viewers to go and check it out. But, instead of me just pulling apart the highlights on my lonesome, I wanted to pull you in at this stage James.
Talk us through how synthetic data is manufactured, its particular usefulness for training AI, and any other color you’d like to add that’s maybe not covered by TechRepublic.
To make fake (or rather, synthetic) data, you start with a random number generator and some constraints. That’s really it. The random number generator gives you a source of pure, distributed randomness, and then you use the constraints to shape the data.
Let’s say we were trying to train a model that relied on yearly household income information, yet we were unable to get our hands on the underlying fine-grained data for that. What we do know, however, is that income follows some rules – nobody makes a negative salary, precision is rarely needed past the thousands mark, etc. Income data also follows a particular distribution. More people make less than $100,000 per year than make more than $100M per year.
If our model isn’t trying to figure out these boundaries and distributions, we can use them to build realistic-looking data according to those constraints, and it is just as useful as the real thing.
The prevalence and production of synthetic data really is growing. You only have to look to recent investments by a range of heavyweights, including the likes of NVIDIA, who’ve poured R&D efforts into its Omniverse™ Replicator, which enables data engineers to build custom synthetic data generation pipelines.
And so thinking about this trend, I’ve got a two-parter for you James: Firstly, as you see it, what are the main forces driving demand for the production of synthetic data for AI and ML model training?
A dearth of data. Sometimes it’s an outright lack of a particular data set, yet we have theories or “common sense” rules to aid in the manufacture of a synthetic dataset. Sometimes it’s due to sheer inaccessibility whether due to cost, legal obstacles, or privacy concerns. It could also be that the data exists, but is too messy for our purposes.
Additionally, since synthetic data will always lack the “rough edges” of real data, so to speak, it helps to impair the training model and keep it from relying on specificities of the data that we deem to be irrelevant. Wouldn’t it be weird if a machine learning model start predicting the likelihood of a promotion, based on whether one’s yearly salary was an odd or even figure?
Now, all this begs another question, James: Can synthetic data actually replace real data when it comes to more efficiently developing new ML models and AI algorithms?
This got me thinking about an article I actually saw late last year from mathematician and industry analyst for analytics and AI, Neil Raden. He was on my analyst briefing ‘hit list’ when I was working at analytics vendor, Yellowfin – a lovely guy. In October ‘21, he wrote a piece for diginomica.com, titled Synthetic data for AI modeling? I’m still not convinced. He rebuts Marcello Benedetti core argument in their article, The Advantages and Limitations of Synthetic Data, that organizations can make mounds of useful data synthetically, by simply replicating real-world data sets and stripping-out personal identifiable information.
Raden argues that anonymizing data, in almost every use case he can think of, has a limiting effect. He concludes, and I quote, “In every document I see, this effect of anonymizing synthetic data seems to imply it is an important feature; all I see are drawbacks”. He also shoots down the use of GANS, or Generative Adversarial Networks – which use two neural networks competing with each other, to generate synthetic data that resembles real data. He says that he doesn’t see the point in creating synthetic data, similar to what you already have, due to its propensity to dilute (or overamplify) your collected data.
He also refutes a summary of benefits offered by synthetic data, stating that: Synthetic data doesn’t overcome data usage restrictions, if all the essential properties of that real data have to be included in order to remain useful. He adds that there is no valid argument he’s read, to support the idea that reliable synthetic data can be created to accurately simulate not-yet-encountered conditions.
He then lists some commonly–accepted limitations of synthetic data creation; among them, that:
- It’s hard to account for outliers
- Output control is difficult to ensure synthetic data replicates complexities within original datasets
- Synthetic data reflects, and can actually amplify, biases in source data
James: What’s your response to all this, and Raden’s assertion that it’s an immature sub-industry with a long way to go? What are the pros and cons of synthetic data creation as you see it?
I think synthetic data can be useful, but in a limited subset of situations. For starters, you absolutely cannot use synthesized data sets to discover anything new about the data. At best, synthetic data is an augmentation to an actual data set. As with any new technology, methodology, or other ology, there are camps on both sides who are exaggerating the benefits and the downsides.
The supporters tend to go overboard in applying synthesis; everything can be made better with more data, and since synthetic data is (by definition) unlimited, it’s always better, right?
Detractors lean the other way. Since synthetic data has a narrow use, it might as well have no use, since real data, if properly cleaned and normalized, does tend to outperform on all fronts.
The software industry had this same argument years ago with the advent of test-driven development. The idea was to write tests, in code, to exercise the other code you were writing. Supporters saw it as an end to the era of user-driven testing. No more would we have to build and ship product to see if it would work! Detractors saw it as “just more code” that would itself have bugs, and besides since you were going to have to let the users actually use the product, why bother with the synthetic part to begin with?
I think synthetic data is most useful when you want to incorporate data that is not directly related to the why of your machine learning exercise, and if (and only if!) you are prepared to accept the limitations of pre-disposing your model in the direction of your synthetic data sets constraints.
Comparing Deep Learning and Machine Learning
Staying with machine learning, because there was an absolute trove of interesting articles on the topic over the last two weeks, CioInsight.com published a really useful piece comparing the uses for – and definitions of – Deep Learning and Machine Learning. Personally, I loved this because you see these two terms used far-too-interchangeably in the data world. I think it does a good job of setting-the-record-straight.
Whilst I encourage everyone to check-out the full article, author Aminu Abdullahi essentially draws the line-in-the-sand like this:
- He says that ML algorithms build models based on structured data. These models can then be used to make predictions about new data. The model’s predictions will become more accurate as it encounters more data.
- By contrast, Abdullahi defines Deep Learning as a subset of Machine Learning, which uses an artificial neural network to learn from large data sets, attempting to make connections between various inputs and outputs.
- He concludes that Machine Learning is best for structured data that can be easily labeled, such as tabular data; while Deep Learning is best for data that is unstructured or has a complex structure, such as images or text
James: Any colour you’d like to add to Abdullahi’s definitions and utility?
To me, the difference comes down to how well a human can understand the final model. Curve fitting is an example of machine learning. With a scatterplot of points – representing two categories of something – on a cartesian plane, we can use something like nearest neighbor to find the formula that divides the plane into areas that represent this bifurcation. The output, this formula, is mostly understandable to people.
Deep learning, on the other hand, deals with artificial neural networks – bizarre little computer programs that attempt to mimic our understanding of the biology of the human brain. The machinery that these algorithms produce (via training) is mostly incomprehensible. We only believe it because we run an exhaustive set of data through it to convince ourselves that the model works.
For me, the most useful component of this article is the table, which compares – side-by-side – the typical components, benefits, drawbacks and use cases for Machine Learning and Deep Learning.
I’m not going to sit here and go through each row line-by-line, but we’ll make sure we’ve got it up on the screen so that you can assess it for yourself. Instead, I’d like to pull you in here again James: Is there anything you’d like to amend or add to this comparative table, juxtaposing Machine Learning and Deep Learning?
Well I’m certainly not going to go through this line-by-line; our video editor already tries to cut out the jokes in the interest of time! Actually, I think it’s a pretty approachable and thorough list.
If I had to add one thing, I’d put a summary line at the bottom: Machine Learning is simpler and more accessible, but limited in scope of application. Deep Learning is less accessible and more complex, but has much broader scope.
IsDevOps and Cloud InfoQ Trends Report
Alright James, the latest newsy tidbit I wanted to discuss from the land of DevOps comes from InfoQ’s June 2022 DevOps and Cloud Trends Report. The report tracks and analyzes major technology trends in the “cloud computing and DevOps” space.
Today I’d like to focus on comparisons between the 2022 and 2021 versions of the report, and look at the major changes taking place, particularly as emerging technologies begin to make the leap across ‘The Chasm’, towards broader adoption and widespread deployment in production environments.
The first section of this comparison comprises some of the biggest changes from 2021 to 2022. Out of the 7 items in 2021’s ‘Innovators’ phase, 4 of them made the leap to the next ‘Early Adopters’ phase – Policy as Code, Hybrid Cloud Approaches, FinOps and Sustainability Accounting.
James: Talk us through these developments. Does anything stand out to you in particular?
The seemingly unstoppable rise of serverless as a way of doing business from day one is pretty fascinating to me. FinOps I think was always inevitable; money talks, after all, but both policy as code and hybrid cloud are tied to this unflagging push to make more things time-billed and ephemeral.
If you read into the remarks of Leon Hall, over in the AWS Developer Relations group, we can kind of understand why: the developer experience is taking center stage at more and more companies and in more and more projects. Ultimately, developer’s don’t want to manage infrastructure. They never have, and I doubt they ever will. They want to get to production, be able to reason about their overall solution, and be able to replicate it locally for development and debugging.
I think this next section is the most interesting to compare, as we look at the technology areas that InfoQ suggest have made that leap across Geoffrey Moore’s ‘Chasm’.
Of the 14 things cited in 2021’s ‘Early Adopters’ category, 5 purportedly leapt The Chasm into the ‘Early Majority’ phase of adoption – Chaos Engineering Practices, Serverless Databases, Edge Computing, Team Topologies, Measuring Performance. James: Talk me through this transition. Does anything surprise you here?
Nothing surprises me, Lachlan! Except that Serverless Databases thing, which in retrospect makes the utmost sense.
Hall actually talks about serverless databases. The ability to morph the rest of your application architecture – the bits your software team didn’t write themselves, is going to be huge for rapid prototyping and rapid delivery. The surprising part about serverless databases, as a concept in this report, is that they didn’t even show up in the 2021 report. That’s what you call a “meteoric rise” in popularity.
The other thing I noticed is that ‘Transformational Leadership’ seemed to disappear, without any explanation. What do you think’s going on here James?
Candidly, any time the word “transformation” is used outside of the mechanical act of changing one thing into another, my BS detector starts going off. The so-called “three T’s”: Application Transformation, Digital Transformation, and Organizational Transformation always seemed to be more buzz than word. Perhaps the InfoQ editors realized that and quietly removed it from the 2022 report hoping no one would notice? They certainly didn’t address it in their article.
In a more serious sense, I think the idea of “leading transformation” was very popular when things like App Tx and Org Tx were hitting walls. It’s very common for people to reach for a figure of authority when things aren’t going well. As an industry, however, I don’t think we have the inertial momentum keeping us in place. Very few people doubt the efficacy of DevOps practices, and no one argues with the results of things like CI/CD, automated testing, observability, and usage-based infrastructure. I think we’re past the need for visionaries leading us into the fray of modern software delivery.
When it comes to evolving from ‘Early Majority’ to the ‘Late Majority’, contrasting the 2021 and 2022 reports highlighted one talking point – ‘Feature Flags & Blue/Green Deployments’ slid into the ‘Late Majority’ phase.
James: Talk to me about what this means. And, do you think there any technologies in the DevOps space transitioning from early to late stage majority adoption worth mentioning that don’t appear here?
Feature flags are an interesting governor on the speed and velocity of agile software delivery. With feature flags, you embed some sort of configuration on something you wouldn’t normally think you’d want to configure: the availability of a new feature. Let’s say you’re Netflix and you’ve got a brand new recommendation algorithm built on top of millions of hours of subscriber viewing. The old recommender is still serviceable, however, so you want to keep that intact.
With a feature flag, you can tell the live production system which recommender to use. When you deploy, this flag gets set to “old” which means nothing functionally changes for your users. At some point, you flip that flag to “new” and see what happens. You keep an eye on error rates, you watch re-engagement metrics out of your observability system. You gauge the impact of using the new recommender on your terms. You haven’t slowed down deployments – those keep happening. You maintain velocity.
At some point, you may find that more people tune out after the last episode of whatever they are binge-watching, and don’t click on any of the other titles that the new recommender recommends. This can be revenue impacting. Luckily, thanks to feature flags, you can keep the code intact and just flip the flag back to “old”. Voila! Instant rollback, no loss of delivery velocity. Win-win.
Blue/green is along the same lines; when you’ve got a new version of something, you push it out and let, say, 5% of the traffic route to the new version. As you gain confidence that the new change won’t break anything, you move more traffic from the old version to the new version. One of those is called “blue” and the other is “green”.
Overall, the transition of both of these to the end of the adoption curve is great news; it means that, as an industry, tech is embracing more elaborate, safer mechanisms for deploying change. That should ensure greater stability all around.
Will AI rule the world?
So James, apparently it was National Artificial Intelligence Appreciation Day over the weekend, therefore it only seems appropriate to ask the question: Will AI rule the world? On that note, are you a Skynet-esk end-to-the-world type guy James?
Nope. I’m a “tiny broken robots running into walls” adherent, myself. For something like Skynet to work, it would require way too much flawless execution in pursuit of a single goal. We humans just can’t do that.
So there were lots of snippets of news out over the last few days, highlighting the growing stream of AI-enabled use cases:
From analyticsinsight.net asking How AI is Changing the World, pointing to developments in AI-powered language learning apps, automated content generation through natural-language systems – like OpenAI’s GPT-3 deep learning tool, which I’ll revisit momentarily.
There was also an awesome article that dissected AI’s impact on astronomy, particularly its ability to help search for exoplanets. I encourage everyone tuning in to go and read this write-up, which also references research detailed in Nature Astronomy, which explains how ML-based models are actually being used to help astronomers redefine the underpinning techniques and mathematical approaches used to search for exoplanets in the first place.
Similarly, while I was down-the-rabbit-hole, new research by Dr. Mike Walmsley from the University of Manchester and the Galaxy Zoo Collaboration outlines how humans and AI have teamed-up to discover 40,000 ring galaxies in space.
And it’s all these latest advancements, James, that led to Forbes asking: Will Artificial Intelligence Rule the World? Author Mike O’sullivan points to several troubling developments in AI. He cites developments in the Swiss government’s Spiez Laboratory, which recently paired its research identifying deadly toxins and infectious diseases with an AI-driven drug discovery platform, called MegaSyn. Left unconstrained by the need to produce good outcomes, MegaSyn was left to freely run in an overnight experiment, in which it produced nearly 40,000 designs of potentially lethal bioweapon standard combinations.
O’sullivan also points to the use of drones in Ukraine to programmatically recognise Russian military equipment and to attack it. He contends that uses like these provide excellent examples of machines, unconstrained by morality, producing very negative outcomes.
So James: Are we all screwed? Is the cat out-of-the-bag? Is SkyNet coming to get us? And, regardless of your view on our collective doom, where’s that moral line for you when it comes to the evolution of AI?
Before the advent of quantum theory, classical physicists were convinced that our universe was static. Newton’s laws would always hold, and that we only needed to grasp all of the variables at play to be able to make flawless predictions. Then along came things like quantum superposition, Lorenze attractors, and other pesky realities that ruined our ability to mathematically eliminate free will and predict all events everywhere.
I think the field of AI is a lot like that. From Hollywood, we get things like Skynet, an AGI or artificial general intelligence, that thinks and acts on its own. In reality, we have mere AI, which is programmed, however esoterically and obliquely, by humans to perform a specific task. If that task is the end of the human race, then I think you’d have to lay the blame at the feet of the people, not the machines.
I think the moral line for AI is the same as the moral line for surveillance, for weapons design and manufacture, and the like: don’t build tools whose sole purpose is to cause human suffering. More broadly (and I hope, more effectively) we need governments to step in and start setting regulation and legislation around this field, just as we did with the Nuclear Regulatory Commission for nuclear research in the late 70s.
If you’d like a down-to-earth view on the current state-of-AI-development, check out Gary Marcus’ recent contribution to the Scientific American. In his article – titled Artificial General Intelligence Is Not as Imminent as You Might Think – Marcus argues that the newest systems are still stymied by the same old problems.
He begins by sarcastically citing some gushing claims made by organizations and executives involved in OpenAI’s DALL-E2 system for creating images from text, and its GPT-3 system for language interpretation and creation.
He also threw a fair degree of shade towards Alphabet’s DeepMind, pointing out that its much-touted Gato project still can’t accurately label an image of a baseball pitcher because, and I quote, “The system has no idea what is actually in the picture, as opposed to what is typical of roughly similar images”.
He points out that testing indicates that DALL-E 2 can’t tell the difference between a red cube on top of a blue cube, and a blue cube on top of a red cube.
Marcus summarizes his position by saying that: “Don’t be fooled. Machines may someday be as smart as people, and perhaps even smarter, but the game is far from over. There is still an immense amount of work to be done in making machines that truly can comprehend and reason about the world around them.”
James: Is this sassy take on the world of AI fair? And how far away do you think we are from unearthing Artificial General Intelligence – an AI that has the flexibility and resourcefulness of human intelligence?
Let’s put this in terms of mathematics. We are currently figuring out how to add up quantities but don’t yet have a concept of zero. You’re asking when we’ll invent Graham’s number. Answer: a long long time from now.
I’ll give you my personal take on things like GPT-3 and DALL-E. A lot of pundits (and people in my Twitter feed, which are not necessarily the same cohort) were out claiming that GPT-3 was the end of the professional author and that DALL-E spelled certain doom for artists everywhere. In reality, these are systems that can be used to produce low-grade works of fiction and art, but are more aptly viewed as tools to augment the natural ability of the human.
Is DALL-E a replacement for the full pipeline of art, from idea to execution? No. Is it a great prompt or exploratory sketch engine? Definitely. Sometimes it spits out some exciting, interesting, or even revolting imagery. In the hands of a true artist, a human artist, those prompts can become the next Magritte, Van Gogh, or Hieronymous Bosch.
Data science vs decision science
To round out today’s episode, I found another instructive article that offers a side-by-side comparison of different industry terms; Analytics Insight’s Top 5 Ways Data Science is Different From Decision Science.
The write-up argues that, because organizations often have both data scientists and decision scientists, there’s general confusion in the industry about who’s responsible for what. Check out the article for the blow-by-blow but, essentially, it argues that: Data Science views data as a tool for asking further questions or enhancing products and services; whereas Decision Science views data as a tool for smarter business decisions.
What do you think, James? Is it as simple as Data Science is the process of using data for analysis, whereas Decision Science is about making choices based on analysis produced?
Before I answer that, can I just ask why in the names of the gods is this article a PDF slide deck embedded in a web site and surrounded by ads?!? If the singularity comes and we as a race do go extinct, I think we deserve it for this particular offense.
I think it’s worthwhile to look at the age of these two fields. I trace data science as far back as John Tukey in the 60’s when he proposed a reboot of the field of statistical analysis through exploratory and confirmatory data analysis. One of my favorite seminal works on data is his 1977 classic Exploratory Data Analysis, which teaches numeracy with pencil and paper. It’s also worth pointing out that Tukey invented the term “bit” in the late forties. But I digress.
Decision science seems to have sprung into existence in late 2019, with the publication of Lorien Pratt’s book Link: How Decision Intelligence Connects Data, Actions, and Outcomes for a Better World. I’d not heard of the field before then, at least. I’d love to be proven wrong; nothing is more fun than reading seminal works in academic fields.
Seriously. I love that stuff. I’m working on a blog post explaining E.F. Codd’s paper on relational algebra to modern data folks.
Anyway, if I had to sum up these two fields in a sound byte, I’d say: Both the decision scientist and the data scientist have an end goal, and use data to get there. The decision scientist’s goal is to provide context for a business person to make the correct decision, in the moment, tactically. The data scientist’s goal is to transform data into information, and information into knowledge, with an ultimate goal of understanding.
Until next time
Ok, we’re out of time for episode 4 of Data and DevOps Digest; brought to you by Vivanti Consulting. James, thanks for your time.
For those watching, thanks for tuning-in. To receive regular Data and DevOps news and analysis, subscribe to our YouTube Channel, or go to vivanti.com/contact-us, and sign-up to our mailing list.
It’s been a pleasure to have your company; bye for now!