Human quality raters have been the mainstay of search engine evaluation for decades but a sea-change is on its way due to the need for scale as machine learning and demand evolves.
The Pitfalls of Keyword Stuffing in SEO Copywriting
Â
Human vs AI Quality Raters for Search Engines.pdf
1. Humans vs
LLMs As Quality
Raters for
Search Engines
Are major changes coming?
Dawn Anderson - March 2024
2. Dawn Anderson
â UK based SEO consultant
â 17 years in SEO
â Occasional SEO conference speaker
â EU, UK, US, Global Search Awards judge
â Previous digital marketing lecturer & trainer
â Industry publication contributor
â Now predominantly consulting all of the
time
Stalker of information retrieval threads and IR conference hashtags since 2017
3. A sea-change is
coming for a
fundamental part
of search
On the other side of the âfront doorâ
4. The important algorithmic ranking
evaluation stage
If importance thresholds
reached
Indexing
Discovery & refresh
Crawling
Dynamic build at runtime
Serving
In response to a query
Ranking (& Re-ranking)
5. The process of search results
evaluation (Ranking System)
Determine how well a âsystemâ (ranking system) fares either currently
(continuous evaluation), or when compared to proposed changes
16. Implicit (âHumanâ in
the loop has no
awareness)
evaluation
â Tests on real searcher segments
â Anonymous scroll and click behaviour
â UX testing on any site (heatmaps /
recordings all fall into this category)
17. Explicit (human knows they are
actively evaluating) evaluation
â E.g. Searchers asked to provide feedback
â NetïŹix users asked to thumbs up a ïŹlm
â Spotify favouriting or playlist building -
leads to further recommendations
â User groups / user panels
â Sites asking for feedback
â Professional expert relevance annotators
â Paid human contractor evaluators
18. But it all mostly comes down to
labels & labelling anyways
IMPORTANT⊠Labels are training data for machine learning
19. Labels are all around us
In vast numbers they are
converted into mathematical
form for machine learning
training data
20. We are ALL data
labellers⊠every
single day
Every day
21. A cohort of similar data
labellers help with
recommender systems
Birds of a feather ïŹock
together⊠they like the
same things
22. Data labels teach machines to
know the diïŹerence between cats
and dogs (reinforcement learning)
Cat, dog, dog, cat, cat, dog, cat,
dog, dog, dog, cat
23. Search engines have used âThe Crowdâ for
HITL evaluation for more than two
decades
24. In search⊠âThe Crowdâ
âlabelsâ sample
comparative search
result sets
âRelevantâ or ânot relevantâ
25. Pair-wise SERP results side-by-side comparisons are in the
majority of relevance evaluation exercises
PAIR-WISE COMPARISON
OR
26. Instead of âyumâ
or âbarfâ labels, itâs
ârelevantâ or
ânot-relevantâ
labels
27. But itâs mostly aggregated binary data
Binary labels rolled up into overall relevance scores
28. EïŹectively a measurement of
NDCG (Normalised Discounted
Cumulative Gain) and / or
DCG (Discounted Cumulative Gain)
42. But to the detriment of quality
âSuch annotation tasks were delegated to crowd
workers, with a substantial decrease in terms of quality
of the annotation, compensated by a huge increase in
annotated data.â (Clarke et al, 2022)
45. Data labelling industry crisisâŠ
Demand outstrips supply
â There is a bottleneck (and itâs going to get worse)
â Not enough labels produced to deal with the size of
machine learning modes
46. âThe global data collection and labeling market size was
valued at $2.22 billion in 2022 and it is expected to expand at a
compound annual growth rate of 28.9% from 2023 to 2030,
with the market then expected to be worth $13.7 billion.â
Source: Grand View Research, 2021
47. Data labellers work across
many industries, many
companies
â Maps
â Assistant
â AI content detection
â Search quality evaluation
â Image detection labelling
â AI content detection training
â Any other ML driven application
48. High risk of under-trained ML models due
to scaling without label volume increase
49. Deepmind researchers - âWe ïŹnd current large language models are signiïŹcantly
undertrained, a consequence of the recent focus on scaling language models
while keeping the amount of training data constant. âŠwe ïŹnd for
compute-optimal training âŠfor every doubling of model size the number of
training tokens should also be doubled.â
â âTraining Compute-Optimal Large Language Modelsâ (Hoffman et al,
2022)
56. âThe crowd is made of people -
Observations from large scale crowd labellingâ (Thomas
et al, 2022)
57. Bing Researchers - âThe Crowd is Made of People:
Observations from large scale crowd-labellingâ
(Thomas et al, 2022)
Findings:
â Fatigue
â Time of day & day of
week
â Anchoring
â Task-switching
â Left-side bias
â General disagreement
on relevance
60. âLarge language models can accurately predict searcher
preferencesâ (Thomas et al, 2023)
Bingâs LLM
& GPT4
research
61. â GPT4 prompt engineering (role
playing prompt)
â (Up to 5) LLM agents to emulate
the behaviour of search
relevance evaluators
â Produce enough gold and silver
labels to build relevance training
data for much larger data sets
â Train the agents initially on gold
labels
âLarge language models can accurately predict searcher
preferencesâ (Thomas et al, 2023)
63. âTo measure agreement with real searchers needs high-quality
âgoldâ labels, but with these we ïŹnd that models produce better
labels than third-party workers, for a fraction of the cost, and
these labels let us train notably better rankers.â (Thomas et al,
2023)
64. Bingâs LLM evaluators - âA
fraction of the cost and
better rankersâ (Thomas et al, 2023)
66. A spectrum of LLM &
human rater collaborative
approach ?
âFrontiers of Information Access Experimentation for Research and Educationâ (Clarke et al, 2022)
68. âIt is yet to be understood what the risks
associated with such technology are: it is likely
that in the next few years, we will assist in a
substantial increase in the usage of LLMs to
replace human annotators.â
(Clarke et al, 2022)
69. ButâŠConcerns about reduced
quality in exchange for scale
âIt is a concern that machine-annotated assessments might
degrade the quality, while dramatically increasing the number
of annotations available.â Clarke et al, 2022
73. Some surmised a switch to AI
evaluators was part of the reason
74. Algorithms - bigger, broader,
multi-modal / multi-aspected
Aspected algorithms quickly go into core or simultaneous
â Product reviews
â Helpful content classiïŹer
â Panda historically
â Spam updates
75. Machine learning classiïŹers
Google is learning quickly:
â what âunhelpful content looks likeâ
â What AI generated content looks like
â What paid links look like