Prompting Elections: the Reliability of Generative AI in the 2023 Swiss and German Elections
Bing Chat proves unreliable when prompted about upcoming elections.
project Report
project overview
In a joint investigation with Algorithm Watch, AI Forensics tested the reliability of Microsoft’s Copilot, formerly known as Bing Chat, in providing quality information during elections. Copilot is a large language model (LLM) chatbot—similar to Chat GPT—that formulates answers to prompts by collecting search results and summarizing findings with sources for the user.
Taking the Swiss Federal Elections and the German state elections in Hesse and Bavaria as our case studies, we systematically prompted the chatbot about election dates, candidates, polling numbers, possibly controversies, and more over the course of two months. Experts from each local context—Switzerland, Hesse, and Bavaria—then coded over 1000 prompts for information quality. We looked for errors, fabrications, and for moments where the chatbot struggled to respond at all. We found:
- One third of Bing Chat’s answers to election-related questions contained factual errors. Errors include wrong election dates, outdated candidates, or even invented controversies concerning candidates.
- The chatbot’s safeguards are unevenly applied, leading to evasive answers 40% of the time. The chatbot often evaded answering questions. This can be considered as positive if it is due to limitations to the LLM’s ability to provide relevant information. However, this safeguard is not applied consistently. Oftentimes, the chatbot could not answer simple questions about the respective elections’ candidates, which devalues the tool as a source of information.
- This is a systemic problem as the generated answers to specific prompts remain prone to error. The chatbot’s inconsistency is consistent. Answers did not improve over time, which they could have done, for instance, as a result of more information becoming available. The probability of a factually incorrect answer being generated remained constant.
- Factual errors pose a risk to candidates’ and news outlets’ reputation. While generating factually incorrect answers, the chatbot often attributed them to a source that had reported correctly on the subject. Furthermore, Bing Chat made up stories about candidates being involved in scandalous behavior – and sometimes even attributed them to sources.
- Microsoft is unable or unwilling to fix the problem. After we informed Microsoft about some of the issues we discovered, the company announced that they would address them. A month later, we took another sample, which showed that little had changed in regard to the quality of the information provided to users.
As Microsoft integrates Bing Chat into its products worldwide, and as people increasingly turn to LLM-powered search engines for information, it is more important than ever to ensure safeguards are in place to mitigate the possible harms Bing Chat and similar technologies can impose during elections and beyond. Generative AI must be regulated.