Deep Dive to Learning to Rank for Graph Search.pptx

Learning To Rank For Graph Search
Junfeng He, Search Quality and Ranking
Joint work with Cristina, Rajat, Hieu, Allan, Maxime, Jiayan,
Ethan, Scott, Kittpat, Alessandro, etc.
10/21/2013

Motivation of This Talk – Halloween!
The most scary Halloween gift:
You have 10 seconds to leave this room. If you choose to stay,
you are fully responsible for all possible consequences.

Outline
Ranking model for graph search
Learn the ranking model for graph search
Experiments and Discussions

Browse Quereies
– photo vertical example
Photos of Rajat, Photos by my friends
Photos liked by Cristina, Photos commented on by me
Photos in Hawaii, photos before 2010, recent photos, …
Photos of my friends in Hawaii this year
My recommended photos of Girish Kumar's co-workers taken in
United States this year liked by Tom Stocky that are commented
on by friends of Lars Eilstrup Rasmussen

The Concept of Ranking Model -- typeahead
f1: FEATURE_SHARED_MUTUAL_FRIEN
f2: FEATURE_DISTANCE_GRAPH
f1: 30 f2: 1
f1: 8 f2: 2
score: 0.86
score: 0.4
A Toy User Scoring Model with Two Features

Ranking Model -- Browse
FEATURE_BROWSE_PHOTO_OF_USER:1
FEATURE_CONSTRAINTS_RATIO: 1
FEATURE_PHOTO_AGE_IN_DAYS_INV: 0.0196
FEATURE_PHOTO_COEFFICIENT_MAX:
0.2817745947658
FEATURE_PHOTO_HAS_FACE:1
…..
Score: 0.96
FEATURE_BROWSE_PHOTO_OF_USER:1
FEATURE_CONSTRAINTS_RATIO: 1
FEATURE_PHOTO_AGE_IN_DAYS_INV: 0.0268
FEATURE_PHOTO_COEFFICIENT_MAX: 2.2E-308
FEATURE_PHOTO_HAS_FACE:1
…..
Score: 0.82

Ranking Model
Intuitively, for searcher a’s request b, what should be the score
for result/document c?
• score: a real number.
• Results with larger score s should be ranked higher
Mathematically, score = Ranking_model(feature_vector)
feature_vector: features that contain info about searcher,
query, result.
Ranking_model: which maps a set of features (i.e., a feature
vector) to a score.

Ranking Features
Total number of features for all verticals today: ~1200
Each vertical contains a subset of features
Photo browse: about 100 features. Examples:
• FEATURE_BROWSE_PHOTO_IN: whether the query is “photos in some place”
• FEATURE_PHOTO_HAS_FACE: whether this photo has face
• FEATURE_PHOTO_NUM_FRIENDS_LIKED: how many of searcher’s friends liked
this photo …
• In sum, features contain info about searcher, query, result(document)

Bucket Ranker
For continuous features: piecewise linear
f1
s
f2
s
For discrete features: step-wise
f1
s
Can approximate any
nonlinear functions if
we create enough
number of buckets
One bucket:
the region between two borders
One bucket:
Every border is one bucket

Bucket Ranker
One example:
{ "features" : {"FEATURE_SHARED_MUTUAL_FRIENDS" :
[0,1,11,240]},
"weights" : [ 0.0206305, 0.0555313, 0.284588, 0]},
f1
Suppose x is the value of one data on feature
FEATURE_SHARED_MUTUAL_FRIENDS
If x <= 0, score= 0.0206305
If x == 1, score= 0.0206305 + 0.0555313
If 1<= x <=11, score= 0.0206305 + 0.0555313+ 0.284588 *(x-1)/(11-1)
If 11<= x <=240, score= 0.0206305 + 0.0555313+ 0.284588+ 0*(x-
11)/(240-11)
If x > 240, score= 0.0206305 + 0.0555313+ 0.284588+ 0
s1

Bucket Ranker
Output of the scoring model: the sum of score from
bucket each on each feature
f1
f 1 
 



Features
j
j
j
Features
j
j f
BR
s
f
R )
(
)
(
f2
f 2
s1
s2
f1: 30  s1: 0.36
f1: 8  s1: 0.3
f2: 1  s2: 0.5
f2: 2  s2: 0.1
score: 0.86
score: 0.4

Conditioned Bucket ranker, i.e., ranking tree
Condition 1:
“photo of user”
Condition 2:
“photo in …”
f1
s
f2
s
Feature 1 Feature 2
Feature
1
Feature
2
 
 


Conditons
i Features
j
j
ij f
BR
f
R
s )
(
)
(
f1
s
f2
s
Face
feature
Face
feature
Outdoor
feature
Outdoor
feature
For different query intent:
Face features is very
important positive feature for
“photo of user” query,
but is not important for
“photos in some place”.

 
 


Conditons
i Features
j
j
ij f
BR
f
R
s )
(
)
(
Example:
Condition 1: “photo of user”
Condition 2: “photo in some place”
For one query “photos of user in
some place”, it will get score from
buckets under both condition 1
and 2
Condition 1:
“photo of user”
Condition 2:
“photo in …”
f1
s
f2
s
Feature 1 Feature 2
Feature
1
Feature
2
f1
s
f2
s
Face
feature
Face
feature
Outdoor
feature
Outdoor
feature
photos of my friends photos taken in Hawaii
photos of my friends taken in Hawaii

A Defense for the Bucket Ranker Model
Question: Why Bucket Ranker, why not Linear, Random Forests,
Boosted Decision Trees, LambdaMART, Bayesian Graph Model (yet)?
Answer:
A White Box model: Good interpretation/debug ability
We still have lots of problems on features, labeling, data logging, etc., so our
data may not be ready to train a black box model
We often need to manually modify the model (e.g., support new queries, hot fix
for an important bug, cover corner cases when training data is not good
enough, etc.,)
Complex enough to guarantee good ranking quality

A Defense for the Bucket Ranker Model
Bucket Ranker gives us a good tradeoff between interpretation/debug
ability and ranking quality interpretation ability
linear Bucket Ranker Black Box
Models
Ranking Quality (up to now)
linear Bucket Ranker Black Box
Models
What if black box models is
significantly better?
A brilliant idea: use the score
from black box model as label
to train a white box model

Engineers in our search ranking team used to manually tune
the model, sometimes consisting of hundreds of curves (i.e.,
piecewise linear functions)
• Tedious!
• Unproductive! Don’t know what the weights/curves should be
Machine learning to rescue !

Outline
Training Data
The Workflow to Learn Conditioned Bucket Ranker
Learning To Rank Techniques
Experiments and Discussions

Training Data
Results shown to users
Random results/samples from the
same search session, but not
shown to users, collected in
indexing servers

Basic Labeling
All random samples are labeled as
negative data -1
Results with target actions, (e.g.,
click, friending, etc.) are labeled as
positive data +1
Results without target actions, are
labeled as negative data -1

More Labeling Strategies
typeahead_balanced:
ignore negative results under the
positive results
positive_only:
ignore all negative results, only
use negative samples
……
Basic labeling is usually the best, or good enough

Outline
Training Data
The Workflow to Learn Conditioned Bucket Ranker
Learning To Rank Techniques
Example Results on photo search

How to Choose the Conditions and Features
Conditions: Manually chosen up to
now, incorporate human domain
knowledge
Features: usually need to remove
obvious meaningless features like
Doc_id
Some ongoing tasks about
suggesting condition and feature
automatically
e.g., frequent pattern mining on queries
Condition 1:
“photo of user”
Condition 2:
“photo in …”
Feature 1 Feature 2
Feature
1
Feature 2

Create Bucket Borders
For continuers features or discrete
features with many possible values
• Percentile
• make sure each buckets contains the
same number of data
For discrete features with few
possible values (like binary features),
or category features such as user
locale
• each feature value is one bucket

Create Bucket Borders
Discrete features with skew
distribution:
Iterative percentile
histogram

Learn the Bucket Ranker
Condition 1:
“photo of user”
Condition 2:
“photo in …”
f1
s
f2
s
Feature 1 Feature 2
Feature
1
Feature
2
f1
s
f2
s
We have condition, features, and
bucket borders now,
the only thing to learn is the
weights of each bucket.

Feature transformation --Bucketization
]
0
,...
0
,
8
.
0
,
1
,...
1
[
:
'ij
f
23
:
ij
f
:
ij
f
the feature vector for feature j
under condition i, after
bucketization.
A vector with the dimension ==
number of buckets
ij
f'
the feature value for feature j
under condition i, a scalar
Scale invariant with percentile buckets

Feature transformation -- Linearization




  
 
x
w
f
h
f ij
ij ,
'
'
)
(
Conditons
i Features
j
R
s
]
0
,...,...
0
,
'
,
0
,...
0
[
: ij
f
x
Condition 1:
“photo of user”
Condition 2:
“photo in …”
f1
s
f2
s
Feature 1 Feature 2
Feature
1
Feature
2
f1
s
f2
s
f: original features
x: features after transformation
Dimension of features x:
Number of conditions *
number of original features *
number of buckets
Learning the whole tree simultaneously ==
learning a linear function s =R(x)= <w, x>
One condition
Satisfied:
Multiple
condition
Satisfied:
]
0
,...,
'
,...
0
,
'
,...
0
[
: ij
ij f
f
x

Learning to Rank
Given lots of training data (feature_vector xi, score si), i=1,…n
Learn a linear ranking function
• score = Ranking_model(feature_vector) = <w, x>

Learning to Rank -- History
A good summary: http://en.wikipedia.org/wiki/Learning_to_rank
Three main categories of methods
• Pointwise learning
• Pairwise learning
• Listwise learning

Pointwise Learning
• For one point (xi , si), the cost function is to make sure R(xi) si
• i.e., if one result gets clicked, its score should be close to 1; otherwise, its score should be close to 0;
• Every supervised regression or classification method are applicable. Examples:
• Linear, Logistic Regression, SVM, etc.,
• FB Ads and Newsfeed ranking team are using methods in this category
• One possible problem: Label is not session specific
X: feature_vector after transformation
s: score
R: ranking model

(Session Specific) Pairwise Learning
• For two points (xi, si), (xj, sj) from the same search session, the cost function is to make sure
R(xi) > R(xj) , if si > sj
• In other words, if result i and j are results from the same search session, and i is clicked, but j is
not, then score i should be higher than score j.
• Examples: RankNet, Ranking SVM , RankBoost, LambdaRank, CRR, etc.
(Session Specific) Listwise Learning
• For all points from the same search session, make sure R(fi) will have the same order as si
• Structure SVM, LambdaMART, etc.
X: feature_vector after transformation
s: score
R: ranking model

CRR: Combined Regression and Ranking
pointwise
term
pairwise
term
regularization
term
E.g.,
f: linear,
l:hinge loss,
t: sign()
or
f: logistic function,
l: a loga + (1-a) log(1-a),
t: (1+y)/2

Learn (Conditioned) Bucket ranker
Solve the problem by Stochastic Gradient Descent (SGD)
methods
• Machine learning toolbox: sofia-ML
Can train with 10M data within 1 hour in a single machine

Data Logging
Logging of the results features at backend
Logging of the click & conversion actions in frontend
• Tables from frontend and backend are joined to create the
ultimate HIVE data table: search_learning_data
• Tables are populated once per day

Train Scoring Models with Machine Learning Pipelines
Run two commands, e.g.,
• search/ranking/util/collect_data_ta.sh '2013-08-21' '2013-08-27' -type users --output
/home/jfh/data.txt
• search/ranking/util/train_model.sh /home/jfh/data.txt /home/jfh/bm_test
Wait for 1-2 hours, obtain your model!
Evaluation and A/B test
▪ More details:
https://our.intern.facebook.com/intern/wiki/index.php/Trainin
g_models_using_hive_data

Results
ML Trained models for typeahead verticals (except group)
are deployed in production
• Metrics are usually slightly better, compared to hand-tune model
Two Trained Models for Browse deployed to production
• Ragtime: clickers by 2.47% and actioners by 3.77%
• Photos: more details in following slides

Basic metrics of the previous handtuned production
model for photo search
(# clicked sessions: 30.8% , # clicks per clicked session: 3.04)
# clicks per session: 0.936,
# likes per session: ~0.07
# comments: < 0.01

A/B test experiments compared to previous hand tuned
production model
Improvement compared to previous handtuned production model
Target Actions #clicks
Per
session
#likes
per
session
#photo queries per
day
#photo search
users per day
click 5% -5% 3% ~1%
like 4% >20% 0-1% ~0%
click, like, comment 5% 7% 3% ~1%
▪ Run A/B test for handtuned model, and 3 machine trained models
▪ The three machine trained models have target actions of “click”, “like”, “click or
like or comment” respectively
▪ Run A/B tests for 3 weeks
Deployed a
our current
production
model.

Case Study – “photos of mark zuckerberg”
Handtuned model Machine trained model
One possible reason: improper feature weights
In handtuned model, the weight of features FEATURE_CONSTRAINTS_RATIO
(i.e., how many people tagged) is not as high as other features e.g.,
FEATURE_PHOTO_LIKES

Case Study –”photos of Girish Kumar’s friends”
One possible reason: “counter”-intuitive features
In hand-tuned model, the older the photos, the lower
the score
In machine trained model, old photos gets low score,
but very old photos (e.g.,>10 years) get highest
score!
Photos taken today
Photos taken >10 years
ago

More analysis on photo age
photos of my friends, photos of a user
photos of a product/company, movie,
public figure, etc.
photos liked a user
photos in some place: almost 0 weights

More
examples
“Photos liked by Lars Eilstrup Rasmussen”
“Photos of Barack Obama”

More examples
“Photos by National Geographic ”
“Photos in Beijing China”

Goal At the End of H2
Train models for all the use cases for all verticals
Continuous training
Cover corner cases well
Offline feature extraction, evaluation
Try different machine learning methods
Easy to analyze and debug the machine learning and
models

The Ultimate Goal
Our life now ▪ Our life in the future

Deep Dive to Learning to Rank for Graph Search.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Deep Dive to Learning to Rank for Graph Search.pptx

Semelhante a Deep Dive to Learning to Rank for Graph Search.pptx (20)

Último

Último (20)

Deep Dive to Learning to Rank for Graph Search.pptx