Deriving a Safe Ethical Architecture for Intelligent Machines (ECAP ’10)

The biggest challenge now facing humanity is how to advance without being enslaved or rendered extinct by our own actions and creations or their unintended consequences. In particular, intelligent machines (IMs) appear to be a tremendous risk. As IMs become more intelligent and take on more responsibilities, their decisions must be informed and constrained by a coherent, integrated ethical structure with no internal inconsistencies for everyone’s safety and well-being.

Unfortunately, no such structure is currently agreed upon to exist. Indeed, there is little agreement even on exactly what morality is. Unfortunately, human ethics are obviously implemented as emotional rules of thumb which are culture-dependent; provably not accessible to conscious reasoning (Hauser et al, 2007); often not optimal in a given situation; and frequently not applied either due to selfishness or inappropriate us vs. them distinctions. Indeed, it is the attempt to analyze such rules under varying, incomplete and inappropriate circumstances that has stymied philosophers for millennia and blocked Wallach and Allen (2009) when they discussed top-down and bottom-up approaches to morality and merging the two but could come to no conclusions.

What is necessary is a universal ethical system that follows the 5 S’s (Simple, Safe, Stable, Self-correcting and Sensitive to current human thinking, intuition, and feelings).
We propose to show a possible path by drawing upon experimental evidence and lessons learned from evolution and economics to describe a morality is actually objective and derivable from first principles. We will then outline a method for deriving a coherent, integrated, platonic ethical system that flows naturally from a single Kantian imperative to the low-level rules of current human sensibilities.

Using a functional definition for morality makes it possible to explain why morality exists, why it is what it is, and gives a yardstick for improvement. Function implies a “purpose” or a “goal” which determines the actions that “ought” to be taken and gives context for the value judgments of right or wrong. Trying to define right and wrong or good and bad (or evil) in the abstract or as platonic entities is a hopeless task. However, in the context of some task or goal, such evaluations can easily be made.

It is an interesting paradox that, despite the tremendous disparity in human goals, there clearly exists a reasonable consensus on the morality of the vast majority of actions. Steve Omohundro (2008) began explaining this when he used micro-economic theory and logic to identify six drives (“universal” sub-goals) which increase the probability of success for any non-conflicting goal: self-improvement, rationality, goal preservation, evaluation protection, self-protection, and to acquire and use resources effectively. He believed, however, that a major problem with these drives, is that “[w]ithout explicit goals to the contrary, [IMs] are likely to behave like human sociopaths in their pursuit of resources.”

Yet, the vast majority of human beings don’t behave this way. As pointed out by James Q. Wilson (1993), the real questions about human behaviors are not why we are so bad but “how and why most of us, most of the time, restrain our basic appetites for food, status, and sex within legal limits, and expect others to do the same.” He attributed the fact that we are generally good even in situations where social constraints do not apply to an evolved “moral sense” that we all possess and are constrained by (just as we wish IMs to be constrained). Experiments in game theory (Axelrod, 1984) clearly show that, while selfish and unethical behavior is logical when interaction is limited to a single occurrence, the situation changes dramatically when an open-ended series of interactions is considered. Reciprocation derives cooperation from selfish motives in the presence of long-term repeated interactions.

Indeed, as Frans de Waal (2006) points out, any zoologist would classify humans as obligatorily gregarious since we “come from a long lineage of hierarchical animals for which life in groups is not an option but a survival strategy”. Humans have evolved to be extremely social because mass cooperation, in the form of community, is the best way to survive and thrive. While intelligence is the most obvious and frequently cited result of evolutions, ants, bees and termites demonstrate that robust success is possible without much individual intelligence at all. We have previous argued (Waser 2008) that acting cooperatively (ethically) is an attractor in the state space of behavior, that humans have therefore evolved to be basically moral, and that shortsighted, self-interest outweighing the sense of cooperation/community now causes almost all of the serious difficulties now experienced by humanity.

What Omohundro overlooked was that there are five more universal sub-goals (drives) that also increase the probability of goal success even more dramatically: cooperation, fairness, community protection, community building, and increasing freedom. Non-exploitable cooperation (via the game-theoretically optimal “optimistic tit-for-tat”) and community building dramatically increase the probability of help and economies of scale while reducing the odds of interference. And fairness (backed by altruistic punishment) protects the community by preventing defection.

Thus, we arrive at the simple functional definition of morality as “that which maximizes the probability of cooperation”. This definition has several advantages. It explains why we (and our mind children) should be moral (self-interest), it provides a context in which to start answering ethical questions, and it doesn’t make any arguable assumptions like the existence of an absolute truth or an omnipotent being. Best of all, it is entirely in line with the latest expert opinions in social psychology (Haidt and Kesebir, 2010).

Realizing that our own ethical structure starts at the top with the evolutionarily-driven Kantian imperative of “Maximize long-term cooperation” and investigating how our moral intuitions have evolved from it allows the derivation of our desired structure from an existing example. Marc Hauser (2006) describes numerous methods for studying the details of human ethics by combining primate studies, infant studies, child development studies, brain scans, studies of unconscious actions, and cross-cultural studies of carefully designed novel situations (“trolley problems”). If we can determine what our ethical rules are (prior to their being overridden by our selfish consciousness), why we have those specific rules, and the phase-boundaries at which particular rules no longer rationally apply, then we will have the universal ethical structure described above.

Progress has been made on a variety of fronts. While designing an artificial economy for the purpose of evolving a program to solve externally posed problems, Eric Baum (2006) asked the question “What rules can be imposed so that each individual agent will be rewarded if and only if the performance of the system improves?” One answer he discovered – conservation of resources via property rights – explains many otherwise inexplicable results such as why we don’t “steal” someone’s body to save five others despite not hesitating to switch a train from a track with five people to a siding with only one and why the “Tragedy of the Commons” arises when property is held in common. Similarly, Darcet and Sonet (2006) used an evolutionary model with simple cost/benefit analyses at the level of single agents to quantitatively explain experimental results on the third-party punishment game, the ultimatum game and altruistic punishment games and confirm that the propensity to punish is a robust emergent property necessary to support cooperation (i.e. we don’t always want our machines to be nice).

An IM with conscious access to the derived universal ethical structure should be as interested in cooperation and in determining the optimal actions for cooperation as the most ethical human, if not more so. It will be as safe as possible; yet, it will also be perfectly free and, since it has been designed in a fashion that is optimal for its own well being, it should always desire to be safe and to maintain or regain that status. What more could one ask for? (except maybe that we humans would become equally ethical as well)


Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.

Baum, E. (2006). What Is Thought? MIT Press.

Darcet, D., Sornette, D. (2006). Cooperation by Evolutionary Feedback Selection in Public Good Experiments. In Working Papers, Social Science Research Network. Available at

de Waal, F. (2006). Primates and Philosophers: How Morality Evolved. Princeton University Press.

Haidt, J., Kesebir, S. (2010). Morality. In S. Fiske, D. Gilbert, & G. Lindzey (eds.), Handbook of Social Psychology, 5th Edition. Wiley.

Hauser, M. (2006). Moral Minds: How Nature Designed Our Universal Sense of Right and Wrong. HarperCollins/Ecco.

Hauser, M. et al. (2007). A Dissociation Between Moral Judgments and Justifications. Mind&Language 22(1):1-27.

Omohundro, S. M. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, & S. Franklin (eds.), Proceedings of the First AGI Conference. IOS Press.

Wallach, W, Allen, C. (2009). Moral Machines: Teaching Robots Right from Wrong. Oxford University Press.

Waser, M. (2008). Discovering The Foundations Of A Universal System Of Ethics As A Road To Safe Artificial Intelligence. In AAAI Technical Report FS-08-04. AAAI Press.

Wilson, J. (1993). The Moral Sense. Fre


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: