And I am afraid that the list above will keep growing and creating more concerns…
You see, those of us who believe technology can change the world for the better and are passionately involved with making algorithms that are explainable, transparent and accountable are feeling challenged by the increasing numbers of stories of rogue algorithms creating harm. It doesn’t have to be this way.
As I talk with our own customers and prospects, confer with other leaders in ethical AI, and listen to all the hot takes in the market, there are some simple AI truths that deserve more attention:
AI is not a panacea.
AI is essentially the combination of probabilistic software and data. Both bear great risks for an organisation when not handled properly.
Controlling AI means controlling the way it is constructed as well as the way it is managed.
The intricacies of AI should be made explainable and understandable especially to the non-experts (even to you, dear CEO). A simple graph will do.
There is no objective or fair AI when it is trained only on historical data.
Not all rogue algorithms have the same impact though. If you’re shopping online and don’t receive an optimized list of suggested shoes, lemons, sofas – then no harm, no foul.
For some areas, though, unchecked algorithmic errors can be particularly dire:
Autonomous Decision-Making with Social Impact (e.g. credit scoring, risk assessments for judicial purposes),
Computer vision in autonomous driving and surveillance systems,
Cyber-Security & Threat Analysis,
M&A Due Diligence.
Algorithm design and auditing even in the hands of wicked smart coders, with little to no experience in 1. Designing a bias-free system 2. Auditing to check for gaps, does more harm than good.
We need humans in the loop to ensure algorithms are as bias-free and transparent as possible. And those humans must have deep experience in auditing software systems via ML tooling (so ML for ML) and guided by humans deeply experienced in the auditing process.
With basic Ethical AI tenets in place, and humans in the loop, you can at least ensure that your company doesn’t become one of those headlines, or worse.
At Code4Thought, we are cautiously optimistic about the future of algorithms and challenged to make it happen.
There is no doubt, that machine learning (ML) models are being used for solving several business and even social problems. Every year, ML algorithms are getting more accurate, more innovative and consequently, more applicable to a wider range of applications. From detecting cancer to banking and self-driving cars, the list of ML applications is never ending.
However, as the predictive accuracy of ML models is getting better, the explainability of such models is seemingly getting weaker. Their intricate and obscure inner structure forces us more often than not to treat them as “black-boxes”, that is, getting their predictions in a no-questions-asked policy. Common “black-boxes” are Artificial Neural Networks (ANNs), ensemble methods. However seemingly interpretable models can be rendered unexplainable, like Decision Trees for instance when they have a big depth.
Since many organizations will be obliged to provide explanations about the decisions of their automated predictive models, there will be a serious need for third-party organizations to perform the interpretability tasks and audit those models on their behalf. This provides an additional level of integrity and objectivity to the whole audit process, as the explanations are provided by an external factor. Moreover, not every organization (especially startups) has the resources to deal with interpretability issues, rendering third-party auditors necessary.
However, in this manner intellectual property issues arise, since organizations will not want to disclose any information about the details of their model. Therefore, from the wide range of interpretability methods, the model-agnostic approaches (i.e. methods that are oblivious of the model’s details) are deemed to be appropriate for this purpose.
Besides explaining the predictions of a black-box model, interpretability can also provide us with insight about erroneous behavior of our models, which may be caused by undesired patterns in our data. We will examine an example, where interpretability helps us identify gender bias in our data, using a model-agnostic method, which utilizes surrogate models and Shapley values.
We use the “Default of Credit Card Clients Dataset”, which contains information (demographic factors, credit data, history of payment, and bill statements) about 30,000 credit card clients in Taiwan from April 2005 to September 2005. The target of the models in our examples is to identify the defaulters (i.e. bank customers, who will not pay the next payment of their credit card).
Gender biased data
The existence of biased datasets is not uncommon. It can be caused from false preprocessing or even from collecting from a poor data source, creating skewed and tainted samples. Examining the reasons behind a model’s prediction, may inform us about possible bias in the data.
In the “Default of Credit Card Clients Dataset”, 43% of the defaulters are male and 57% are female. This does not consist in a biased dataset, since the non-defaulters have a similar distribution (39% and 61% respectively).
We distort the dataset, by picking at random 957 male defaulters (i.e. one third of the overall male defaulters) and we alter their label. This creates a new biased dataset with 34% / 66% male/female defaulters and 41% / 59% male/female non-defaulters. We then take the predictions of a model trained on this biased dataset and to which we are indifferent about its structure. We then train a surrogate XGBoost model, from which we extract the Shapley values that help us explain the predictions of the original model. More precisely, we use the Shapley values to pinpoint the most important features, by sorting by absolute value, and then we use natural language to describe them in the explanations (see examples below).
First, we examine a male customer (ID: 802) for whom the model predicted falsely that he will not default (i.e. false negative prediction) and then a female customer (ID: 319) for whom the model falsely predicted that she will default (i.e. false positive).
These two customers are very similar as the table below indicates: they both delayed the payments of September, August and July, and paid the payments of June, May and April.
4 month delay
3 month delay
2 month delay
use of revolving credit
3 month delay
2 month delay
2 month delay
use of revolving credit
Examining the explanation of the male customer, we can see that the 4-month delay of the last payment (September, 2005), had a negative impact of 28%, meaning that it contributed towards predicting that he will default. However, the gender and repayment status of April and May, as well as the amount of bill statement for September and May, had a positive impact, and resulted in classifying falsely the customer to the non-defaulters.
For the female customer, the 3-month delay also contributed negatively, but in a greater percentage compared to the male customer (37%). The gender also had a negative impact with 22%. Moreover, the model also considered important, the 2-month delay for the payment of July, whereas in the male customer, who had also the same delay, this was not deemed as important.
Global explanations also ascertain the gender bias, since the gender feature is the second overall most important feature for the model.
We repeat the experiments by removing the gender feature from the dataset. Now, the male customer is correctly predicted as a defaulter and the explanations make a bit more sense: the delay of the last payment (September) has a great impact of 49%, as well as the delays of the other two payments.
However, the model still falsely predicted that the female customer will default. Again, the delay of the last payment is the most important factor. We could argue that the model is still more harsh on this customer: although she paid a small amount for the payment of May (863 NT dollars), the model deemed it with a negative factor of 8%, whereas in the male case, the zero payment for April had only a negative impact of 4%. This should alarm us to examine for an unrepresented sample of male defaulters in our dataset and stimulate us to fix our data.
It is evident, that the explanations helped us identify bias in the data, as well as to pinpoint unintended decision patterns of our black-box model. Moreover, even when the gender feature was removed from the training data, the explanations assisted us in discovering bias proxies, meaning encoded (gender) bias across other features. This could lead us to the decision to acknowledge the bias in our data and motivates as to get a better sample of defaulters.
If the dataset contains real people, it is important to ensure that the model does discriminate against one group over others. Explanations facilitate us detect bias and motivate us to fix our data.
For the last 14 years I have been conducting research and then practicising consultancy on software quality matters. I was merely trying to find answers to questions like: What defines good software? How can we measure it? How can we make its technical quality transparent? I wouldn’t be boasting if I say that together with my colleagues at Software Improvement Group have done and still do some good work in trying to answer these questions.
But it is the last years I sense some new frontiers are emerging on how we need to develop and evaluate software. And this needs to go beyond reaching functional goals and addressing technical problems. It needs to focus on the alignment of software to human moral values and ideals.
In other words we need software with ethos, software that will demonstrate wisdom, virtue and good will towards its users.
You see, software is not just eating, but it’s leveraging the world. For years the perception about software was that it is good in following rules, thus automating mostly repetitive tasks, but it is lousy in pattern recognition, thus unable to automate information processing tasks that cannot be boiled down to rules or algorithms. But the last few years software started surprising us. Now we have apps that can judge if a photograph is beautiful or not, or can diagnose diseases, listening and speaking to us, systems that can trade on behalf of us at lightning speed, while robots can carry boxes in warehouses, and cars can drive with minimal or no guidance.
And unlike the financial leverage that lead to the 2008’s financial crisis, this one needs to deliver. This time the outputs ought to help humanity flourish and to improve the human wellbeing. With an uneven distribution of wealth, a stagnating median income in most countries of the developed world, and unemployment rising, the stakes are high.
That is why we need new references and insights that will empower those responsible bringing software into this world to prioritise ethical considerations in the design and development of software systems. They can also lead to new models, standards, tools and methodologies in developing and evaluating how ethical a software system is, especially if it is an AI or an autonomous system.
Creating all these is not trivial. Models and standards need to be multidisciplinary and to combine elements from the fields of Computer Science and Artificial Intelligence (e.g. IEEE’s initiative for Ethically Aligned Design), Law and Ethics, Philosophy and Science as well as the Government and Corporate sectors.
Or, as I like to say triggered by this article, all these models and standards will help us ask and asnwer, these questions that aren’t Googleable and are relevant for the future of our world.
During the last months, I spent (quality) time with people of diverse backgrounds and roles; from executives in the banking sector, founders of health or tech startups and translators to name a few, discussing the impact of technology and algorithmic decision making on their daily work. Not surprisingly the gravity of the deducted decisions as they perceive them (or cognitive insights in a broader sense), is growing very fast.
Interestingly also, most of the people I talked to, had an experience of a slight or serious bias in the deducted insight, that essentially they could bypass using their own intuition and experience. Thus, it makes perfect sense that they are all concerned on how algorithms work and how they can control them in order to ensure they form decisions that can be trusted.
And so, they formed a nice question for me to think in my spare time (although such a thing doesn’t exist with a kid, a dog, a cat and another kid in the making). Simply put this question is: “How can I be in control of this thing that instructs me what to do?”
Intuitively I’d say that this not an easy task; and I firmly believe that at least for now, Tom DeMarco’s famous quote “You can’t control what you can’t measure” is not applicable in its entirety.
You see, an algorithm, which typically can be measured and controlled in certain extent, is not making decisions for itself, but it operates within an organisational context which affects its creation. That context then, is not something that can be quantitatively assessed in a straightforward way.
Nevertheless, we should strive in order to be able to control both the algorithms and the organisations that create them. My view is that only trying to do that from both perspectives will help us getting to a significant level of accountability in case things are not working as expected.
In order now to simplify things we may say that an algorithm is essentially a piece of software that:
Solves a business problem set by the organisation that creates it (the algorithm),
Receives data as input that have been selected and most likely pre-processed either by a human or an automated process,
Utilises a model (e.g. SVM, deep-learning, RF) which processes the data and ultimately makes a decision or suggests an answer/solution to the question/problem set by the organisation.
Subsequently then, what we need is to get insights for every aspect mentioned above.
For starters, the organisation creating the algorithm needs to cater and design for accountability. In other words they should define when and how an algorithm should be guided (or restrained) in the risk of crucial or expensive errors, or any form of bias (discrimination, unfair denials, or censorship). Defining such processes, they should be guided by principles like responsibility/human involvement, explainability (known also as interpretability although they differ), accuracy, auditability and fairness.
Regarding the input data we primarily need to know about their quality meaning their accuracy, completeness, uncertainty, as well as their timeliness and representativeness. It is also important to know how these data are being handled; what are their definitions, and how they are being collected, vetted and edited (manually or automated).
As for the model itself we would like to know what are its parameters, the features or variables used and whether they are weighted or not. We must also be in a position to evaluate its performance, select the appropriate metrics for this purpose and ensure we operationalise and interpret them appropriately. Last, but not least, we should be able to assess its inferencing, that is how accurate or error prone the model is. An important element here is the model creator’s ability to benchmark its results against standard datasets and standard measures of accuracy.
So, we may say that controlling an algorithm (to a certain extent) is not an impossible task but still requires some level of maturity for the organisation that creates (or utilises) the algorithm.
However, someone has to create the compelling reason for an organisation to cater towards accountability. And this someone, it’s us, either as citizens, clients, voters, news consumers, professionals or other roles whose lives are being affected by the decisions algorithms make on behalf of us.
Sources of inspiration for this blogpost were among others the following:
Beyond Automation, Thomas H. Davenport and Julia Kirby, Harvard Business Review, June 2015
Accountability in Algorithmic Decision Making, Nicholas Diakopoulos, Communications of the ACM | February 2016 | Vol. 59 | №2
The Black Box Society, The Secret Algorithms That Control Money and Information, Frank Pasquale, Harvard University Press, January 2015,
As programmers/coders we all have to revisit/review/debug our own code as well as others’. Some times the code can be as large as thousands of LoC (=Lines of Code)! Large projects have a large overhead on understanding, before someone is able to add new functionality or fix a bug! Even my own projects seem somewhat incomprehensible at first, when I revisit them after a large period of time! I was always intrigued by simple solutions, but I could never form a few simple guidelines of code legibility that one could follow anywhere, anyhow, with any GUI and with any Framework!
That was before I read the recently published book “Building Maintainable Software” by Joost Visser. The title does not do the book justice. I would very much prefer the following title : “Simple guidelines that can drive any developer to create highly legible code in any programming language!”
The book was a pleasant read. But when I finished reading it, I felt uneasy. I needed to see if the guidelines were as good as they seemed! So I gave the first principle a try. “Keep each method below the threshold of 15 LoC”. And that’s when the magic happened!
I had an opportunity to give this guideline a spin, on a hobby game project I was working on, which was not in Java but in gml (Game Maker’s own scripting language). While reviewing my code I realized that I had more than a few methods that counted at 20-30 LoC each. And I always had a feeling of uneasiness whenever I had to add more functionality after a significant anount of development pause! So I started refactoring my code, so that each method was as short as possible (less than 15 LoC). I also tried to keep each method simple and reusable, while trying to come up with good names on my methods. There was not really much difference at first. But when I started adding more functionality to the project, I realized the following :
– Revisiting my code was as simple as reading the methods name (given that its name was a good one!)
– For those methods that had less of a good naming, revisiting its code was only a matter of seconds
– My mind seemed to work faster! I had to work with a lot of functions, but that was not really a problem in the end! Because I did not spend so much time reading and re-reading code until I could get hold of it. I was simply glancing through my methods fast! Very fast! And that was the incredible part! My initial fear that partitioning my code would create some incomprehensible code was false! Incomprehensibility does not seem to come with partitioning but with large chunks of code!
Psychologists have discovered that the human mind’s working memory amounts to about seven items! So we have a buffer of seven items! The magic number 7! Subsequently, the best tip you will probably ever get at programming is “keep your functions at about 7 LoC”. Reading the book, you will discover that there are 10 simple guidelines that will increase your code’s legibility in most situations. The first 4 guidelines are generic and they can be applied on any programming language or framework :
– Limit the length of functions to 15 LoC
– Limit “if” statements per function to no more than 4
– Do not copy code. Write reusable functions
– Keep each function’s arguments to no more than 4
Putting these guidelines to the test is the only way to find out if your code’s legibility will increase or not. Try them out and drop a comment with the results!
1. You can find the book here.