SlideShare uma empresa Scribd logo
1 de 7
Baixar para ler offline
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 1 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
A.I. Has a Measurement Problem
Which A.I. system writes the best computer code or
generates the most realistic image? Right now,
there’s no easy way to answer those questions.
April 15, 2024
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 2 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
Davide Comai
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 3 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
There’s a problem with leading artificial intelligence tools like ChatGPT,
Gemini and Claude: We don’t really know how smart they are.
That’s because, unlike companies that make cars or drugs or baby
formula, A.I. companies aren’t required to submit their products for testing
before releasing them to the public. There’s no Good Housekeeping seal
for A.I. chatbots, and few independent groups are putting these tools
through their paces in a rigorous way.
Instead, we’re left to rely on the claims of A.I. companies, which often use
vague, fuzzy phrases like “improved capabilities” to describe how their
models differ from one version to the next. And while there are some
standard tests given to A.I. models to assess how good they are at, say,
math or logical reasoning, many experts have doubts about how reliable
those tests really are.
This might sound like a petty gripe. But I’ve become convinced that a lack
of good measurement and evaluation for A.I. systems is a major problem.
For starters, without reliable information about A.I. products, how are
people supposed to know what to do with them?
I can’t count the number of times I’ve been asked in the past year, by a
friend or a colleague, which A.I. tool they should use for a certain task.
Does ChatGPT or Gemini write better Python code? Is DALL-E 3 or
Midjourney better at generating realistic images of people?
I usually just shrug in response. Even as someone who writes about A.I. for
a living and tests new tools constantly, I’ve found it maddeningly hard to
keep track of the relative strengths and weaknesses of various A.I.
products. Most tech companies don’t publish user manuals or detailed
release notes for their A.I. products. And the models are updated so
frequently that a chatbot that struggles with a task one day might
mysteriously excel at it the next.
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 4 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
Shoddy measurement also creates a safety risk. Without better tests for
A.I. models, it’s hard to know which capabilities are improving faster than
expected, or which products might pose real threats of harm.
In this year’s A.I. Index — a big annual report put out by Stanford
University’s Institute for Human-Centered Artificial Intelligence — the
authors describe poor measurement as one of the biggest challenges
facing A.I. researchers.
“The lack of standardized evaluation makes it extremely challenging to
systematically compare the limitations and risks of various A.I. models,”
the report’s editor in chief, Nestor Maslej, told me.
For years, the most popular method for measuring artificial intelligence
was the so-called Turing Test — an exercise proposed in 1950 by the
mathematician Alan Turing, which tests whether a computer program can
fool a person into mistaking its responses for a human’s.
But today’s A.I. systems can pass the Turing Test with flying colors, and
researchers have had to come up with new, harder evaluations.
One of the most common tests given to A.I. models today — the SAT for
chatbots, essentially — is a test known as Massive Multitask Language
Understanding, or MMLU.
The MMLU, which was released in 2020, consists of a collection of
roughly 16,000 multiple-choice questions covering dozens of academic
subjects, ranging from abstract algebra to law and medicine. It’s
supposed to be a kind of general intelligence test — the more of these
questions a chatbot answers correctly, the smarter it is.
It has become the gold standard for A.I. companies competing for
dominance. (When Google released its most advanced A.I. model, Gemini
Ultra, earlier this year, it boasted that it had scored 90 percent on the
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 5 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
MMLU — the highest score ever recorded.)
Dan Hendrycks, an A.I. safety researcher who helped develop the MMLU
while in graduate school at the University of California, Berkeley, told me
that the test was never supposed to be used for bragging rights. He was
alarmed by how quickly A.I. systems were improving, and wanted to
encourage researchers to take it more seriously.
Mr. Hendrycks said that while he thought MMLU “probably has another
year or two of shelf life,” it will soon need to be replaced by different,
harder tests. A.I. systems are getting too smart for the tests we have now,
and it’s getting more difficult to design new ones.
“All of these benchmarks are wrong, but some are useful,” he said. “Some
of them can serve some utility for a fixed amount of time, but at some
point, there’s so much pressure put on it that it reaches its breaking point.”
There are dozens of other tests out there — with names like TruthfulQA
and HellaSwag — that are meant to capture other facets of A.I.
performance. But just as the SAT captures only part of a student’s intellect
and ability, these tests are capable of measuring only a narrow slice of an
A.I. system’s power.
And none of them are designed to answer the more subjective questions
many users have, such as: Is this chatbot fun to talk to? Is it better for
automating routine office work, or creative brainstorming? How strict are
its safety guardrails?
(The New York Times has sued OpenAI, the maker of ChatGPT, and its
partner, Microsoft, on claims of copyright infringement involving artificial
intelligence systems that generate text.)
There may also be problems with the tests themselves. Several
researchers I spoke to warned that the process for administering
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 6 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
benchmark tests like MMLU varies slightly from company to company, and
that various models’ scores might not be directly comparable.
There is a problem known as “data contamination,” when the questions
and answers for benchmark tests are included in an A.I. model’s training
data, essentially allowing it to cheat. And there is no independent testing
or auditing process for these models, meaning that A.I. companies are
essentially grading their own homework.
In short, A.I. measurement is a mess — a tangle of sloppy tests, apples-to-
oranges comparisons and self-serving hype that has left users, regulators
and A.I. developers themselves grasping in the dark.
“Despite the appearance of science, most developers really judge models
based on vibes or instinct,” said Nathan Benaich, an A.I. investor with Air
Street Capital. “That might be fine for the moment, but as these models
grow in power and social relevance, it won’t suffice.”
The solution here is likely a combination of public and private efforts.
Governments can, and should, come up with robust testing programs that
measure both the raw capabilities and the safety risks of A.I. models, and
they should fund grants and research projects aimed at coming up with
new, high-quality evaluations. (In its executive order on A.I. last year, the
White House directed several federal agencies, including the National
Institute of Standards and Technology, to create and oversee new ways of
evaluating A.I. systems.)
Some progress is also emerging out of academia. Last year, Stanford
researchers introduced a new test for A.I. image models that uses human
evaluators, rather than automated tests, to determine how capable a
model is. And a group of researchers from the University of California,
Berkeley, recently started Chatbot Arena, a popular leaderboard that pits
anonymous, randomized A.I. models against one another and asks users
4/22/24, 10:47 AM
A.I. Has a Measurement Problem - The New York Times
Page 7 of 7
https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email
to vote on the best model.
A.I. companies can also help by committing to work with third-party
evaluators and auditors to test their models, by making new models more
widely available to researchers and by being more transparent when their
models are updated. And in the media, I hope some kind of Wirecutter-
style publication will eventually emerge to take on the task of reviewing
new A.I. products in a rigorous and trustworthy way.
Researchers at Anthropic, the A.I. company, wrote in a blog post last year
that “effective A.I. governance depends on our ability to meaningfully
evaluate A.I. systems.”
I agree. Artificial intelligence is too important a technology to be evaluated
on the basis of vibes. Until we get better ways of measuring these tools,
we won’t know how to use them, or whether their progress should be
celebrated or feared.

Mais conteúdo relacionado

Semelhante a A.I. Has a Measurement Problem: Can It Be Solved?

The Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 KeynoteThe Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 KeynoteAndrew Clark
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...IRJET Journal
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...IRJET Journal
 
Do clickers beat pcs for testing students
Do clickers beat pcs for testing studentsDo clickers beat pcs for testing students
Do clickers beat pcs for testing studentsDr. Tina Rooks
 
Towards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTowards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTao Xie
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAIRCC Publishing Corporation
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAIRCC Publishing Corporation
 
How to create a taxonomy for management buy-in
How to create a taxonomy for management buy-inHow to create a taxonomy for management buy-in
How to create a taxonomy for management buy-inMary Chitty
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyBoPeng76
 
Detection of Fake News Using Machine Learning
Detection of Fake News Using Machine LearningDetection of Fake News Using Machine Learning
Detection of Fake News Using Machine LearningIRJET Journal
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class QuantUniversity
 
Diy research trends webinar(2) revised(2)
Diy research trends webinar(2) revised(2)Diy research trends webinar(2) revised(2)
Diy research trends webinar(2) revised(2)QuestionPro
 
Altimeter Social ROI Report
Altimeter Social ROI ReportAltimeter Social ROI Report
Altimeter Social ROI ReportGoodbuzz Inc.
 
The Social Media ROI Cookbook
The Social Media ROI CookbookThe Social Media ROI Cookbook
The Social Media ROI CookbookSmomedia.ru
 

Semelhante a A.I. Has a Measurement Problem: Can It Be Solved? (20)

The Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 KeynoteThe Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 Keynote
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...A study of cyberbullying detection using Deep Learning and Machine Learning T...
A study of cyberbullying detection using Deep Learning and Machine Learning T...
 
Do clickers beat pcs for testing students
Do clickers beat pcs for testing studentsDo clickers beat pcs for testing students
Do clickers beat pcs for testing students
 
Ai Now institute 2017 report
 Ai Now institute 2017 report Ai Now institute 2017 report
Ai Now institute 2017 report
 
Faisal Seminar.pptx
Faisal Seminar.pptxFaisal Seminar.pptx
Faisal Seminar.pptx
 
Towards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that MattersTowards Mining Software Repositories Research that Matters
Towards Mining Software Repositories Research that Matters
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
 
F1033541
F1033541F1033541
F1033541
 
Techtalk March 2018
Techtalk March 2018Techtalk March 2018
Techtalk March 2018
 
How to create a taxonomy for management buy-in
How to create a taxonomy for management buy-inHow to create a taxonomy for management buy-in
How to create a taxonomy for management buy-in
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparency
 
Detection of Fake News Using Machine Learning
Detection of Fake News Using Machine LearningDetection of Fake News Using Machine Learning
Detection of Fake News Using Machine Learning
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class
 
Diy research trends webinar(2) revised(2)
Diy research trends webinar(2) revised(2)Diy research trends webinar(2) revised(2)
Diy research trends webinar(2) revised(2)
 
Altimeter Social ROI Report
Altimeter Social ROI ReportAltimeter Social ROI Report
Altimeter Social ROI Report
 
Altimeter social roi report
Altimeter social roi reportAltimeter social roi report
Altimeter social roi report
 
The Social Media ROI Cookbook
The Social Media ROI CookbookThe Social Media ROI Cookbook
The Social Media ROI Cookbook
 

Mais de LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowWho’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowLUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP
 

Mais de LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP (20)

Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Why democracy dies in Trumpian boredom (by Edward Luce)
Why democracy dies in Trumpian boredom (by Edward Luce)Why democracy dies in Trumpian boredom (by Edward Luce)
Why democracy dies in Trumpian boredom (by Edward Luce)
 
An interview with the director of "Zone of Interest"
An interview with the director of "Zone of Interest"An interview with the director of "Zone of Interest"
An interview with the director of "Zone of Interest"
 
"Schindler’s List" : an oral history with the actors
"Schindler’s List" : an oral history with the actors"Schindler’s List" : an oral history with the actors
"Schindler’s List" : an oral history with the actors
 
Chinese Startup 01.AI Is Winning the Open Source AI Race
Chinese Startup 01.AI Is Winning the Open Source AI RaceChinese Startup 01.AI Is Winning the Open Source AI Race
Chinese Startup 01.AI Is Winning the Open Source AI Race
 
The mismeasuring of AI: How it all began
The mismeasuring of AI: How it all beganThe mismeasuring of AI: How it all began
The mismeasuring of AI: How it all began
 
Google’s Gemini Marketing Trick: what a trickster!
Google’s Gemini Marketing Trick: what a trickster!Google’s Gemini Marketing Trick: what a trickster!
Google’s Gemini Marketing Trick: what a trickster!
 
Inside the Magical World of AI Prompters on Reddit
Inside the Magical World of AI Prompters on RedditInside the Magical World of AI Prompters on Reddit
Inside the Magical World of AI Prompters on Reddit
 
Regulators blame Bezos for making Amazon worse in new lawsuit details
Regulators blame Bezos for making Amazon worse in new lawsuit detailsRegulators blame Bezos for making Amazon worse in new lawsuit details
Regulators blame Bezos for making Amazon worse in new lawsuit details
 
Bariatric Surgery at 16
Bariatric Surgery at 16Bariatric Surgery at 16
Bariatric Surgery at 16
 
Palestinians Claim Social Media 'Censorship' Is Endangering Lives
Palestinians Claim Social Media 'Censorship' Is Endangering LivesPalestinians Claim Social Media 'Censorship' Is Endangering Lives
Palestinians Claim Social Media 'Censorship' Is Endangering Lives
 
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowWho’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
 
Why ChatGPT Is Getting Dumber at Basic Math
Why ChatGPT Is Getting Dumber at Basic MathWhy ChatGPT Is Getting Dumber at Basic Math
Why ChatGPT Is Getting Dumber at Basic Math
 
U.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
U.S. and E.U. Finalize Long-Awaited Deal on Sharing DataU.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
U.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
 
Will A.I. Become the New McKinsey?
Will A.I. Become the New McKinsey?Will A.I. Become the New McKinsey?
Will A.I. Become the New McKinsey?
 
AI is already writing books, websites and online recipes
AI is already writing books, websites and online recipesAI is already writing books, websites and online recipes
AI is already writing books, websites and online recipes
 
What happens when ChatGPT lies about real people?
What happens when ChatGPT lies about real people?What happens when ChatGPT lies about real people?
What happens when ChatGPT lies about real people?
 
The Brilliant Inventor Who Made Two of History’s Biggest Mistakes
The Brilliant Inventor Who Made Two of History’s Biggest MistakesThe Brilliant Inventor Who Made Two of History’s Biggest Mistakes
The Brilliant Inventor Who Made Two of History’s Biggest Mistakes
 
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spyWirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
 

Último

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

A.I. Has a Measurement Problem: Can It Be Solved?

  • 1. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 1 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email A.I. Has a Measurement Problem Which A.I. system writes the best computer code or generates the most realistic image? Right now, there’s no easy way to answer those questions. April 15, 2024
  • 2. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 2 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email Davide Comai
  • 3. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 3 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email There’s a problem with leading artificial intelligence tools like ChatGPT, Gemini and Claude: We don’t really know how smart they are. That’s because, unlike companies that make cars or drugs or baby formula, A.I. companies aren’t required to submit their products for testing before releasing them to the public. There’s no Good Housekeeping seal for A.I. chatbots, and few independent groups are putting these tools through their paces in a rigorous way. Instead, we’re left to rely on the claims of A.I. companies, which often use vague, fuzzy phrases like “improved capabilities” to describe how their models differ from one version to the next. And while there are some standard tests given to A.I. models to assess how good they are at, say, math or logical reasoning, many experts have doubts about how reliable those tests really are. This might sound like a petty gripe. But I’ve become convinced that a lack of good measurement and evaluation for A.I. systems is a major problem. For starters, without reliable information about A.I. products, how are people supposed to know what to do with them? I can’t count the number of times I’ve been asked in the past year, by a friend or a colleague, which A.I. tool they should use for a certain task. Does ChatGPT or Gemini write better Python code? Is DALL-E 3 or Midjourney better at generating realistic images of people? I usually just shrug in response. Even as someone who writes about A.I. for a living and tests new tools constantly, I’ve found it maddeningly hard to keep track of the relative strengths and weaknesses of various A.I. products. Most tech companies don’t publish user manuals or detailed release notes for their A.I. products. And the models are updated so frequently that a chatbot that struggles with a task one day might mysteriously excel at it the next.
  • 4. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 4 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email Shoddy measurement also creates a safety risk. Without better tests for A.I. models, it’s hard to know which capabilities are improving faster than expected, or which products might pose real threats of harm. In this year’s A.I. Index — a big annual report put out by Stanford University’s Institute for Human-Centered Artificial Intelligence — the authors describe poor measurement as one of the biggest challenges facing A.I. researchers. “The lack of standardized evaluation makes it extremely challenging to systematically compare the limitations and risks of various A.I. models,” the report’s editor in chief, Nestor Maslej, told me. For years, the most popular method for measuring artificial intelligence was the so-called Turing Test — an exercise proposed in 1950 by the mathematician Alan Turing, which tests whether a computer program can fool a person into mistaking its responses for a human’s. But today’s A.I. systems can pass the Turing Test with flying colors, and researchers have had to come up with new, harder evaluations. One of the most common tests given to A.I. models today — the SAT for chatbots, essentially — is a test known as Massive Multitask Language Understanding, or MMLU. The MMLU, which was released in 2020, consists of a collection of roughly 16,000 multiple-choice questions covering dozens of academic subjects, ranging from abstract algebra to law and medicine. It’s supposed to be a kind of general intelligence test — the more of these questions a chatbot answers correctly, the smarter it is. It has become the gold standard for A.I. companies competing for dominance. (When Google released its most advanced A.I. model, Gemini Ultra, earlier this year, it boasted that it had scored 90 percent on the
  • 5. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 5 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email MMLU — the highest score ever recorded.) Dan Hendrycks, an A.I. safety researcher who helped develop the MMLU while in graduate school at the University of California, Berkeley, told me that the test was never supposed to be used for bragging rights. He was alarmed by how quickly A.I. systems were improving, and wanted to encourage researchers to take it more seriously. Mr. Hendrycks said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. A.I. systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones. “All of these benchmarks are wrong, but some are useful,” he said. “Some of them can serve some utility for a fixed amount of time, but at some point, there’s so much pressure put on it that it reaches its breaking point.” There are dozens of other tests out there — with names like TruthfulQA and HellaSwag — that are meant to capture other facets of A.I. performance. But just as the SAT captures only part of a student’s intellect and ability, these tests are capable of measuring only a narrow slice of an A.I. system’s power. And none of them are designed to answer the more subjective questions many users have, such as: Is this chatbot fun to talk to? Is it better for automating routine office work, or creative brainstorming? How strict are its safety guardrails? (The New York Times has sued OpenAI, the maker of ChatGPT, and its partner, Microsoft, on claims of copyright infringement involving artificial intelligence systems that generate text.) There may also be problems with the tests themselves. Several researchers I spoke to warned that the process for administering
  • 6. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 6 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email benchmark tests like MMLU varies slightly from company to company, and that various models’ scores might not be directly comparable. There is a problem known as “data contamination,” when the questions and answers for benchmark tests are included in an A.I. model’s training data, essentially allowing it to cheat. And there is no independent testing or auditing process for these models, meaning that A.I. companies are essentially grading their own homework. In short, A.I. measurement is a mess — a tangle of sloppy tests, apples-to- oranges comparisons and self-serving hype that has left users, regulators and A.I. developers themselves grasping in the dark. “Despite the appearance of science, most developers really judge models based on vibes or instinct,” said Nathan Benaich, an A.I. investor with Air Street Capital. “That might be fine for the moment, but as these models grow in power and social relevance, it won’t suffice.” The solution here is likely a combination of public and private efforts. Governments can, and should, come up with robust testing programs that measure both the raw capabilities and the safety risks of A.I. models, and they should fund grants and research projects aimed at coming up with new, high-quality evaluations. (In its executive order on A.I. last year, the White House directed several federal agencies, including the National Institute of Standards and Technology, to create and oversee new ways of evaluating A.I. systems.) Some progress is also emerging out of academia. Last year, Stanford researchers introduced a new test for A.I. image models that uses human evaluators, rather than automated tests, to determine how capable a model is. And a group of researchers from the University of California, Berkeley, recently started Chatbot Arena, a popular leaderboard that pits anonymous, randomized A.I. models against one another and asks users
  • 7. 4/22/24, 10:47 AM A.I. Has a Measurement Problem - The New York Times Page 7 of 7 https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html?utm_source=substack&utm_medium=email to vote on the best model. A.I. companies can also help by committing to work with third-party evaluators and auditors to test their models, by making new models more widely available to researchers and by being more transparent when their models are updated. And in the media, I hope some kind of Wirecutter- style publication will eventually emerge to take on the task of reviewing new A.I. products in a rigorous and trustworthy way. Researchers at Anthropic, the A.I. company, wrote in a blog post last year that “effective A.I. governance depends on our ability to meaningfully evaluate A.I. systems.” I agree. Artificial intelligence is too important a technology to be evaluated on the basis of vibes. Until we get better ways of measuring these tools, we won’t know how to use them, or whether their progress should be celebrated or feared.