
Bits With Brains
Curated AI News for Decision-Makers
What Every Senior Decision-Maker Needs to Know About AI and its Impact
The Perils of AI Benchmark Sensationalism: A Sober Perspective
12/10/23
Editorial team at Bits with Brains
The launch of Google's new natural language model Gemini came with great fanfare and bold claims. Google CEO Sundar Pichai hailed it as "more accurate, relevant and safe" than rival models like GPT-4 in an array of language tasks.

Behind the slick presentations and carefully curated benchmarks, questions linger about Gemini's actual capabilities, the relevance of the tests used, and the potential for hype or even deception to improperly influence perceptions.
Gemini's Impressive Yet Narrow Language Abilities
There’s little doubt Gemini exhibits state-of-the-art prowess on certain natural language processing benchmarks. Google shared multiple examples of Gemini matching or even exceeding GPT-4 in areas like search relevance, question answering, coding tasks and more.
But the picture becomes more nuanced upon closer examination.
First, Google compared Gemini to a previous version of GPT-4, not the newer GPT-4 Turbo model which shows significant additional gains. So, parity claims are already on shaky ground.
Additionally, Gemini's strong suit appears focused specifically on language. Independent testing, albeit with Gemini Pro and not Gemini Ultra, found it struggles substantially at computer vision tasks like image recognition and counting objects, despite Google showcasing demos of Gemini narrating image slideshows and analyzing video frames.
Further inspection revealed these demos were essentially pre-scripted, with Gemini fed caption text rather than interpreting raw visual data. The presentations obscured this sleight of hand, perhaps to project more general intelligence than the model can achieve.
So, in areas like search, text summarization and coding, Gemini reaches impressive new milestones that can enable helpful applications. But current evidence suggests its reasoning abilities are quite narrow, while vision capabilities lag far behind rivals despite exaggerated marketing claims. At least for the mid-level Gemini Pro, which is the variant that was tested. Like most AI systems today, it excels within a constrained domain but still lacks robust general intelligence or common sense.
The Fickle Nature of AI Benchmarks
Given Gemini's uneven performance, the benchmark tests used to showcase its abilities deserve some scrutiny. Technology commentators quickly identified issues with the benchmarks Google selected for comparison. Most focused narrowly on language tasks where Gemini specializes, rather than more diverse determinations of intelligence. Tests were often simplified versions of research benchmarks, straying further from real-world usage. The comparisons excluded several widely recognized benchmarks where GPT-4 likely still dominates, introducing possible bias. Some benchmarks had confusing names or metrics, making fair side-by-side comparisons difficult.
This illustrates the slippery challenge of properly evaluating large language models. Their spectacular breadth comes with uneven capabilities across different tasks. Small test changes can substantially impact relative outcomes between models. And model performance often diverges significantly from controlled benchmarks to practical application.
Standards are actively evolving but still leave much room for selectivity, or even manipulation in how benchmarks are utilized. Companies jockeying for supremacy have inherent conflicts of interest here. But academics also recognize benchmarking as an immense challenge still requiring considerable research. In the absence of consensus standards, healthy skepticism of marketing claims remains prudent.
An Ethical Imperative for Safety
Safety and ethics should be paramount as AI systems gain immense reach. A toxic internet strain of racism, misogyny and abuse is already evident in some online communities. If new models merely amplify such harmful biases, it could further marginalize vulnerable populations.
We know OpenAI invested heavily in alignment techniques and oversight mechanisms for GPT-4, but far less is publicly known about Gemini's safety precautions. While Google emphasizes responsible development, cooperation even between tech allies currently seems limited.
Industry initiatives around AI ethics are fortunately gaining momentum, including partnerships on technical standards and best practices. But competitive pressures could still push companies to prioritize speed over diligence regarding social impacts. And promising governance frameworks like the EU’s AI Act still face lengthy adoption timelines. So, the onus likely falls for now on both creators and civil groups to champion safety-conscious advancement over reckless profiteering.
Multimodal Potential Still Constrained Today
Both Gemini and GPT-4 hint at a future permeated with multimodal AI, seamlessly converging language, vision, and other modalities. Google suggests eventually using Gemini to enhance search engines, identifying objects in uploaded images to improve results. Other speculated applications include automated video generation, medical diagnosis aids, accessibility tools and more.
But current evidence indicates such capabilities remain largely aspirational.
In demos, Gemini failed at basic vision requests like counting simple shapes. GPT-4 managed better but still struggled with visual puzzles requiring elementary reasoning. And real-time computer vision for applications like live captioning is still lacking. So, while steady progress continues, today’s models have narrow vision abilities that require heavy guidance. Deploying them into assistive roles risks frustrating users with poor performance.
Managing expectations poses an ongoing obstacle. Companies naturally wish to promote their innovations, researchers want to highlight progress, and the media jumps readily on AI hype. But rational skepticism is warranted when demos seem to misrepresent actual capabilities. Moderating claims with transparency around limitations better serves the public and field alike.
Pondering the Societal Impacts
Stepping back, the rapid pace of language model advancement has profound importance extending far beyond business competition. Already such models can generate helpful articles, code, emails and more. But they also risk accelerating misinformation, media manipulation and polarization.
Systems like Gemini and GPT-4 will likely bring immense economic shifts as well, automating more knowledge work and disrupting job markets.
Some hail this technical progress for its convenience and coming prosperity. But others increasingly warn about unintended consequences regarding bias, privacy, automation, hacking, autonomous weapons and more.
Further out, advanced AI poses complex philosophical questions around consciousness, self-determination, inequality, human obsolescence, and our very place in this new world.
The trajectories all point toward AI permeating nearly all sectors and aspects of society. Reckless application risks damage, while prudent guidance improves the odds of inclusive benefit. But determining policies and priorities suitable to guide such unprecedented capability represents no small task. It demands our most thoughtful leaders and researchers be proactively involved, seeking wisdom and insight wherever it can be found.
Beyond the Hype, Vigilance and Balance
The unveiling of ambitious new models like Gemini continues to push the boundaries of what machines can achieve. Google's claims justifiably highlight the remarkable progress being made in natural language processing. But gaps between marketing hype and actual capabilities warrant fact-checking caution, not just for Gemini but across much of the field.
Leaders must also look past short-term commercial interests and grapple earnestly with the serious ethical questions arising. And researchers should continue driving progress but with care, foresight, and concern for the common good.
Striking the right balances poses profound challenges but also opportunities. With ethical standards and governance providing guidance, advanced AI can hopefully uplift society and empower individuals. But we must ensure through diligence and cooperation that its disruptive potential promotes prosperity broadly, not just for a few. If the entire technology community heeds this call, perhaps the coming machine age need not be feared, but instead embraced for its promise.
Sources:
https://www.theverge.com/2023/12/6/23990466/google-gemini-llm-ai-model
https://9to5google.com/2023/12/06/google-gemini-1-0/
Sources