top of page

Whose Values Should AI Be Aligned With?

  • Writer: Andrew Alam-Nist
    Andrew Alam-Nist
  • Mar 30
  • 6 min read

There has been a lot of buzz in the AI/tech bro world recently about AI alignment. Ilya Sutskever, a co-founder and former board member of OpenAI, left the company last year over concerns that the company was not sufficiently investing in alignment and safety. OpenAI soon disbanded their “superalignment” team.


What does this all mean? Well, AI alignment refers to the process of matching AI to human values. If all or nearly all humans believe that murder is wrong, then a properly aligned AI will also consider murder wrong. If you ask it for help with your terrorist plot, it will respectfully refrain. 


AI alignment is more difficult than it may seem. Anthropic and OpenAI, two frontier labs, have both published detailed white papers describing how a hastily built AI could seem aligned, but actually be deceiving its creator. Yet there is another problem that is one step prior to the challenge of aligning AI with a given set of values: selecting its values.


Unfortunately, we humans cannot all agree on one set of moral values. For thousands of years, moral philosophers have debated what goodness is and how it manifests itself. Current philosophers have views ranging from sharia law to moral nihilism to Kantian deontology and beyond. This poses a problem for AI because it is literally impossible for it to be aligned with all of these views. 


While most frontier AI labs discuss the importance of alignment, they rarely state which specific people it should be aligned with. This raises two questions–firstly which set of values should we choose and, secondly, what should AI’s method for considering different value sets be. 


One response would be to let developers select their own values. If they are socially conservative, they make a model which opposes abortion, believes in “family values”, etc. If they are liberal, they do the opposite.


However, this response seems dangerous. It does not account for the risk of moral outliers. It is very possible for a coder to have morally anomalous values which are in some way unacceptable. I wouldn’t want Sam Bankman Fried to create my AI model and for it to believe in lying and stealing, for instance. 


A second view is that AI engineers should try to find the most common view, either in their country or globally, and then replicate it. We could call this the “strict majoritarian” method. This is better than the previous solution, yet still poses problems if one believes that morality is anything other than a culturally embedded set of descriptors. For instance, an AI aligned this way in the South in the 1800s would likely support slavery. An AI aligned this way in late 1930s Germany would be antisemitic.


The majority vote is not always the right vote. Besides, the vote could be exceptionally close. Would 55% of people supporting voluntary euthanasia (or visa versa) be definitive evidence that is right? I don’t think so. This is particularly true because public opinion shifts over time.


Thus, if our method for determining AI’s values shouldn’t be the whim of a single engineer or a simple majority, what should it be? To answer this question, I’m going to take a quick detour.


Ethics often gets compared to science. This is often not intended to portray ethics positively. Science allows you to conduct experiments with empirical evidence, and then falsify hypotheses. For instance, I can test whether when I drop an apple, it will go down or up. Seeing that it drops to the ground, I can build a theory of gravity à la Newton.


To its detractors, ethics lacks this verification method. Where is the falsification? People have been arguing over the same sh*t for ages. Yet, something Yale Philosophy Professor Shelly Kagan told me which I found fairly convincing is that our intuitions are a form of evidence. I have an intuition that murder is wrong. I have an intuition that stealing is wrong. And most people share my intuition. 


Backtracking what I said about a lack of consensus earlier, it turns out there is actually a remarkable amount of overlap in our intuitions person-to-person. Don’t defile the dead. Be kind to people around you. While the frontiers of morality (e.g. euthanasia, abortion, etc) may provoke virulent debate, that’s actually a relatively small set of cases. Most people intuitively know, in most cases, what is right and wrong.


This resembles science. Just as the frontiers of morality are debated, so are the frontiers of science and empirical observation. Right now, we have no theory of everything. Is quantum theory right? Is general relativity correct? Our current understanding states that they can’t both be. 


Yet the simpler elements of our theories are not widely debated. Sure, some fringe academics may believe that morality does not exist or that killing is good. However, fringe voices also believe that the world is flat. This is not a reason to reject either ethics or science. 


There are a number of objections to my earlier characterization of morality which one may field. For the sake of this article’s length, I won’t get into them. However, this detour serves to illustrate a point which I think is important for AI alignment – that in many cases there is overwhelming consensus in our moral beliefs which can be an epistemological basis for finding the “right” morality. 


This is important for AI alignment because it lets us taxonomise moral situations. They necessarily fall into one of three buckets: actions which are generally considered moral/benign, actions which are generally considered immoral, and actions with no broad consensus.


Moral/benign actions would be anything which is widely considered actively good, or at least not bad. For instance, giving to charity is good. Raising your left leg (assuming it’s not hitting anything/one) is neutral. AI should say they are morally allowed and, if asked, give your instructions on how to raise your left leg (usefulness of this prompt aside). 


Likewise, if something is widely considered bad, AI should actively say so. If asked to help with such a thing, it should refuse. For instance, AI should not instruct me on how to make a pipe bomb.


One may ask how my perspective differs from majoritarian voting, which I discarded earlier. While the two are similar, it is a question of threshold. When there is broad moral consensus, 95%+ (or whatever other threshold you decide) agree. This eliminates a lot of the uncertainty you would have in a 55% vote. The cases I mentioned earlier – slavery in the South, the mistreatment of Jews, had vocal dissent, meaning that they would not pass this test. 


How then should we treat the third bucket – cases where there is genuine moral debate? In these cases, AI should reflect this epistemic uncertainty. When asked whether an action is good or bad, it should not provide a definitive answer. Instead, it should provide the arguments on either side and allow the audience to judge for themselves.


Likewise, when asked to do something which is seen as morally nebulous, it should politely refuse. It is better to err on the side of safety, with some sort of precautionary principle, because negative harms seem to be far more tawdry than the potential positive consequences AI can bring. 


Consider, for instance, euthanasia. A user may ask an AI how to euthanize their grandpa. Imagine further that you approach the morality of euthanasia from a position of epistemic uncertainty. You do not know whether euthanasia is good or not. In fact, there is a 50/50 chance of either being true. Should AI tell them the chemicals they need to mix to create a poison serum or lethal injection? Surely not. Setting aside the potential for this recipe to be used to hurt others beyond grandpa, the potential harm of wrongly killing the grandpa seems to far outweigh the benefit of ending his suffering in positive cases. 


As a general maxim, therefore, AI seems to have more of a negative duty to not do harm than to positively help out in morally nebulous cases. 


I thus conclude that AI alignment when in doubt should prioritise inaction, and reasoned deliberation which weighs both sides in the case of genuine moral ambiguity. This does not mean that AI should lack any values or be nihilistic. Every nation in the world has signed the Universal Declaration of Human Rights, which Anthropic’s Claude uses as a starting point for its morality training. This is no bad thing. 


However, my answer to whose values AI should adopt is the following: in the case of broad consensus, everybody’s. Without this consensus, nobody’s.


 
 
 
bottom of page