SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Preparation of a tax audit
with Machine Learning
“Feature Importance” analysis applied
to accounting using XGBoost R package
Meetup Paris Machine Learning Applications Group – Paris – May 13th, 2015
Who am I?
Michaël Benesty
@pommedeterre33 @pommedeterresautee fr.linkedin.com/in/mbenesty
• CPA (Paris): 4 years
• Financial auditor (NYC): 2 years
• Tax law associate @ Taj (Deloitte - Paris) since 2013
• Department TMC (Computerized tax audit)
• Co-author XGBoost R package with Tianqi Chen (main author) & Tong
He (package maintainer)
WARNING
Everything that will be presented
tonight is exclusively based
on open source software
Please try the same at home
Plan
1. Accounting & tax audit context
2. Machine learning application
3. Gradient boosting theory
Accounting crash course 101 (1/2)
Accounting is a way to transcribe economical operations.
• My company buys €10 worth of potatoes to cook delicious French
fries.
Account number Account Name Debit Credit
601 Purchase 10.00
512 Bank 10.00
Description: Buy €10 of potatoes to XYZ
Accounting crash course 101 (2/2)
French Tax law requires many more information in my accounting:
• Who?
• Name of the potatoes provider
• Account of the potatoes provider
• When?
• When the accounting entry is posted
• Date of the invoice from the potatoes seller
• Payment date
• …
• What?
• Invoice ref
• Item description
• …
• How Much?
• Foreign currency
• …
• …
Tax audit context
Since 2014, companies audited by the French tax administration shall
provide their entire accounting as a CSV / XML file.
Simplified* example:
EcritureDate|CompteNum|CompteLib|PieceDate|EcritureLib|Debit|Credit
20110805|601|Purchase|20110701|Buy potatoes|10|0
20110805|512|Bank|20110701|Buy potatoes|0|10
*: usually there are 18 columns
Example of a trivial apparent anomaly
Article 39 of French tax code states that (simplified):
“For FY 2011, an expense is deductible from P&L 2011 when its
operative event happens in 2011”
In our audit software (ACL), we add a new Boolean feature to
the dataset: True if the invoice date is out of 2011, False
otherwise
Boring tasks to perform by a human
Find a pattern to predict if accounting entry will be tagged as an anomaly
regarding the way its fields are populated.
1. Take time to display lines marked as out of FY
demo dataset (1 500 000 lines) ≈ 100 000 lines marked having invoice out of FY
2. Take time to analyze 18 columns of the accounting
from 200 to >> 100 000 different values per column
3. Take time to find a pattern/rule by hand. Use filters. Iterate.
4. Take time to check that pattern found in selection is not in remaining
data
What Machine Learning can do to help?
1. Look at whole dataset without human help
2. Analyze each value in each column without human help
3. Find a pattern without human help
4. Generate a (R-Markdown) report without human help
Requirements:
• Interpretable
• Scalable
• Works (almost) out of the box
2 tries for a success
1st try: Subgroup mining (Failed)
Find feature values common to a group of observations which are
different from the rest of the dataset.
2nd try: Feature importance on decision tree based
algorithm (Success)
Use predictive algorithm to describe the existing data.
1st try: Subgroup mining algorithm
Find feature values common to a group of observations which are different from
the rest of the dataset.
1. Find an existing open source project
2. Check it gives interpretable results in reasonable time
3. Help project main author on:
• reducing memory footprint by 50%, fixing many small bugs (2 months)
• R interface (1 month)
• Find and fix a huge bug in the core algorithm just before going in production (1 week)
After the last bug fix, the algorithm was too slow to be used on real accounting…
2nd try: XGBoost
Available on R, Python, Julia, CLI
Fast speed and memory efficient
• Can be more than 10 times faster than GBM in Sklearn and R (Benchmark on GitHub deposit)
• New external memory learning implementation (based on distributed computation implementation)
Distributed and Portable
• The distributed version runs on Hadoop (YARN), MPI, SGE etc.
• Scales to billions of examples (tested on 4 billions observations / 20 computers)
XGBoost won many Kaggle competitions, like:
• WWW2015 Microsoft Malware Classification Challenge (BIG 2015)
• Tradeshift Text Classification
• HEP meets ML Award in Higgs Boson Challenge
• XGBoost is by far the most discussed tool in ongoing Otto competition
Iterative feature importance with XGBoost (1/3)
Shows which features are the most important to predict if an entry has
its field PieceDate (invoice date) out of the Fiscal Year.
In this example, FY is from 2010/12/01
to 2011/11/30
It is not surprising to have PieceDate
among the most important features
because the label is based on this
feature! But the distribution of
important invoice date is interesting
here.
Most entries out of the FY have the
same invoice date:
20111201
Iterative feature importance with XGBoost (2/3)
Since in previous slide, one feature represents > 99% of the gain we
remove it from the dataset and we run a new analysis.
Most entries
are related to
the same
JournalCode
(nature of
operation)
Iterative feature importance with XGBoost (3/3)
Entries marked as out of FY have the same invoice date, and are related
to the same JournalCode. We run a new analysis without JournalCode:
Most of the
entries with an
invoice date
issue are
related to
Inventory
accounts!
That’s the kind
of pattern we
were looking
for
XGBoost explained in 2 pics (1/2)
Classification And Regression Tree (CART)
Decision tree is about learning a set of rules:
if 𝑋1 ≤ 𝑡1 & if 𝑋2 ≤ 𝑡2 then 𝑅1
if 𝑋1 ≤ 𝑡1 & if 𝑋2 > 𝑡2 then 𝑅2
…
Advantages:
• Interpretable
• Robust
• Non linear link
Drawbacks:
• Weak Learner 
• High variance
XGBoost explained in 2 pics (2/2)
Gradient boosting on CART
• One more tree = loss mean decreases = more data explained
• Each tree captures some parts of the model
• Original data points in tree 1 are replaced by the loss points for tree 2 and 3
Learning a model ≃ Minimizing the loss
function
Given a prediction 𝑦 and a label 𝑦, a loss function ℓ measures the
discrepancy between the algorithm's 𝑛 prediction and the desired 𝑛 output.
• Loss on training data:
𝐿 =
𝑖=1
𝑛
ℓ(𝑦𝑖, 𝑦𝑖)
• Logistic loss for binary classification:
ℓ 𝑦𝑖, 𝑦𝑖 = −
1
𝑛 𝑖=1
𝑛
𝑦𝑖 log 𝑦𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦𝑖)
Logistic loss punishes by the infinity* a false certainty in prediction 0; 1
*: lim
𝑥→0+
log 𝑥 = −∞
Growing a tree
In practice, we grow the tree greedily:
• Start from tree with depth 0
• For each leaf node of the tree, try to add a split. The change of objective after adding the
split is:
𝐺𝑎𝑖𝑛 =
𝐺 𝐿
2
𝐻𝐿 + 𝜆
+
𝐺 𝑅
2
𝐻 𝑅 + 𝜆
−
𝐺 𝐿 + 𝐺 𝑅
2
𝐻 𝑅 + 𝐻𝐿 + 𝜆
− 𝛾
G is called sum of residual which means the general mean direction of the residual we
want to fit.
H corresponds to the sum of weights in all the instances.
𝛾 and 𝜆 are 2 regularization parameters.
Score of
left child Score of right child Score if we don’t split
Complexity cost by
introducing
Additional leaf
Tianqi Chen. (Oct. 2014) Learning about the model: Introduction to Boosted Trees
Gradient Boosting
Iteratively learning weak classifiers with respect to a distribution and
adding them to a final strong classifier.
• Each round we learn a new tree to approximate the negative gradient
and minimize the loss
𝑦𝑖
(𝑡)
= 𝑦𝑖
(𝑡−1)
+ 𝑓𝑡(𝑥𝑖)
• Loss:
𝑂𝑏𝑗(𝑡)
=
𝑖=1
𝑛
ℓ 𝑦𝑖, 𝑦 𝑡−1
+ 𝑓𝑡(𝑥𝑖) + Ω(𝑓𝑡)
Friedman, J. H. (March 1999) Stochastic Gradient Boosting. Complexity cost
by introducing
additional tree
Tree t predictionWhole model prediction
Gradient descent
“Gradient Boosting is a special case of the functional gradient descent
view of boosting.”
Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). Boosting Algorithms as Gradient Descent in Function Space.
2D View
Loss
Sometimes
you are lucky
Usually you finish here
Building a good model for feature importance
For feature importance analysis, in Simplicity Vs Accuracy trade-off,
choose the first. Few rule of thumbs (empiric):
• nrounds: number of trees. Keep it low (< 20 trees)
• max.depth: deepness of each tree. Keep it low (< 7)
• Run iteratively the feature importance analysis and remove the most
important features until the 3 most important features represent less
than 70% of the whole gain.
Love XGBoost? Vote XGBoost!
Otto challenge
Help XGBoost open source project to spread knowledge by voting for
our script explaining how to use our tool (no prize to win)
https://www.kaggle.com/users/32300/tianqi-chen/otto-group-product-classification-
challenge/understanding-xgboost-model-on-otto-data
Too much time in your life?
• General papers about gradient boosting:
• Greedy function approximation a gradient boosting machine. J.H. Friedman
• Stochastic Gradient Boosting. J.H. Friedman
• Tricks used by XGBoost
• Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani (for the second-order statistics for tree
splitting)
• Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang (proposes to do fully corrective step, as well
as regularizing the tree complexity)
• Learning about the model: Introduction to Boosted Trees. Tianqi Chen. (from the author of XGBoost)

Mais conteúdo relacionado

Mais procurados

『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』The Japan DataScientist Society
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Tokyo r94 beginnerssession3
Tokyo r94 beginnerssession3Tokyo r94 beginnerssession3
Tokyo r94 beginnerssession3kotora_0507
 
『データ解析におけるプライバシー保護』勉強会 #2
『データ解析におけるプライバシー保護』勉強会 #2『データ解析におけるプライバシー保護』勉強会 #2
『データ解析におけるプライバシー保護』勉強会 #2MITSUNARI Shigeo
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点Ichigaku Takigawa
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定Joe Suzuki
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
KaggleのテクニックYasunori Ozaki
 
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)Ichigaku Takigawa
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化gree_tech
 
わかりやすいパターン認識 4章
わかりやすいパターン認識 4章わかりやすいパターン認識 4章
わかりやすいパターン認識 4章Motokawa Tetsuya
 
レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方Shun Nukui
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summerHarada Kei
 
強化学習その3
強化学習その3強化学習その3
強化学習その3nishio
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)Takanori Ogata
 
SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来Hidekazu Oiwa
 
5 クラスタリングと異常検出
5 クラスタリングと異常検出5 クラスタリングと異常検出
5 クラスタリングと異常検出Seiichi Uchida
 
XGBoostからNGBoostまで
XGBoostからNGBoostまでXGBoostからNGBoostまで
XGBoostからNGBoostまでTomoki Yoshida
 

Mais procurados (20)

『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
『機械学習による故障予測・異常検知 事例紹介とデータ分析プロジェクト推進ポイント』
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Tokyo r94 beginnerssession3
Tokyo r94 beginnerssession3Tokyo r94 beginnerssession3
Tokyo r94 beginnerssession3
 
『データ解析におけるプライバシー保護』勉強会 #2
『データ解析におけるプライバシー保護』勉強会 #2『データ解析におけるプライバシー保護』勉強会 #2
『データ解析におけるプライバシー保護』勉強会 #2
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
 
連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定
 
Kaggleのテクニック
KaggleのテクニックKaggleのテクニック
Kaggleのテクニック
 
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化
 
わかりやすいパターン認識 4章
わかりやすいパターン認識 4章わかりやすいパターン認識 4章
わかりやすいパターン認識 4章
 
レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方レコメンドエンジン作成コンテストの勝ち方
レコメンドエンジン作成コンテストの勝ち方
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summer
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)
 
SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来
 
5 クラスタリングと異常検出
5 クラスタリングと異常検出5 クラスタリングと異常検出
5 クラスタリングと異常検出
 
XGBoostからNGBoostまで
XGBoostからNGBoostまでXGBoostからNGBoostまで
XGBoostからNGBoostまで
 

Semelhante a Feature Importance Analysis with XGBoost in Tax audit

XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docx
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docxBTE 320-498 Summer 2017 Take Home Exam (200 poi.docx
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docxAASTHA76
 
The Role Of Software And Hardware As A Common Part Of The...
The Role Of Software And Hardware As A Common Part Of The...The Role Of Software And Hardware As A Common Part Of The...
The Role Of Software And Hardware As A Common Part Of The...Sheena Crouch
 
Introduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxIntroduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxMMCOE, Karvenagar, Pune
 
Introduction to Data Structure and algorithm.pptx
Introduction to Data Structure and algorithm.pptxIntroduction to Data Structure and algorithm.pptx
Introduction to Data Structure and algorithm.pptxesuEthopi
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxmelbruce90096
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZCharles Vestur
 
370_13735_EA221_2010_1__1_1_Linear programming 1.ppt
370_13735_EA221_2010_1__1_1_Linear programming 1.ppt370_13735_EA221_2010_1__1_1_Linear programming 1.ppt
370_13735_EA221_2010_1__1_1_Linear programming 1.pptAbdiMuceeTube
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm AnalysisMary Margarat
 

Semelhante a Feature Importance Analysis with XGBoost in Tax audit (20)

XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docx
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docxBTE 320-498 Summer 2017 Take Home Exam (200 poi.docx
BTE 320-498 Summer 2017 Take Home Exam (200 poi.docx
 
Lec1
Lec1Lec1
Lec1
 
Lec1
Lec1Lec1
Lec1
 
Software Sizing
Software SizingSoftware Sizing
Software Sizing
 
193_report (1)
193_report (1)193_report (1)
193_report (1)
 
The Role Of Software And Hardware As A Common Part Of The...
The Role Of Software And Hardware As A Common Part Of The...The Role Of Software And Hardware As A Common Part Of The...
The Role Of Software And Hardware As A Common Part Of The...
 
Introduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxIntroduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptx
 
Lec1
Lec1Lec1
Lec1
 
Introduction to Data Structure and algorithm.pptx
Introduction to Data Structure and algorithm.pptxIntroduction to Data Structure and algorithm.pptx
Introduction to Data Structure and algorithm.pptx
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
1-introduction.ppt
1-introduction.ppt1-introduction.ppt
1-introduction.ppt
 
370_13735_EA221_2010_1__1_1_Linear programming 1.ppt
370_13735_EA221_2010_1__1_1_Linear programming 1.ppt370_13735_EA221_2010_1__1_1_Linear programming 1.ppt
370_13735_EA221_2010_1__1_1_Linear programming 1.ppt
 
lp 2.ppt
lp 2.pptlp 2.ppt
lp 2.ppt
 
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
 

Último

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 

Último (20)

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 

Feature Importance Analysis with XGBoost in Tax audit

  • 1. Preparation of a tax audit with Machine Learning “Feature Importance” analysis applied to accounting using XGBoost R package Meetup Paris Machine Learning Applications Group – Paris – May 13th, 2015
  • 2. Who am I? Michaël Benesty @pommedeterre33 @pommedeterresautee fr.linkedin.com/in/mbenesty • CPA (Paris): 4 years • Financial auditor (NYC): 2 years • Tax law associate @ Taj (Deloitte - Paris) since 2013 • Department TMC (Computerized tax audit) • Co-author XGBoost R package with Tianqi Chen (main author) & Tong He (package maintainer)
  • 3. WARNING Everything that will be presented tonight is exclusively based on open source software Please try the same at home
  • 4. Plan 1. Accounting & tax audit context 2. Machine learning application 3. Gradient boosting theory
  • 5. Accounting crash course 101 (1/2) Accounting is a way to transcribe economical operations. • My company buys €10 worth of potatoes to cook delicious French fries. Account number Account Name Debit Credit 601 Purchase 10.00 512 Bank 10.00 Description: Buy €10 of potatoes to XYZ
  • 6. Accounting crash course 101 (2/2) French Tax law requires many more information in my accounting: • Who? • Name of the potatoes provider • Account of the potatoes provider • When? • When the accounting entry is posted • Date of the invoice from the potatoes seller • Payment date • … • What? • Invoice ref • Item description • … • How Much? • Foreign currency • … • …
  • 7. Tax audit context Since 2014, companies audited by the French tax administration shall provide their entire accounting as a CSV / XML file. Simplified* example: EcritureDate|CompteNum|CompteLib|PieceDate|EcritureLib|Debit|Credit 20110805|601|Purchase|20110701|Buy potatoes|10|0 20110805|512|Bank|20110701|Buy potatoes|0|10 *: usually there are 18 columns
  • 8. Example of a trivial apparent anomaly Article 39 of French tax code states that (simplified): “For FY 2011, an expense is deductible from P&L 2011 when its operative event happens in 2011” In our audit software (ACL), we add a new Boolean feature to the dataset: True if the invoice date is out of 2011, False otherwise
  • 9. Boring tasks to perform by a human Find a pattern to predict if accounting entry will be tagged as an anomaly regarding the way its fields are populated. 1. Take time to display lines marked as out of FY demo dataset (1 500 000 lines) ≈ 100 000 lines marked having invoice out of FY 2. Take time to analyze 18 columns of the accounting from 200 to >> 100 000 different values per column 3. Take time to find a pattern/rule by hand. Use filters. Iterate. 4. Take time to check that pattern found in selection is not in remaining data
  • 10. What Machine Learning can do to help? 1. Look at whole dataset without human help 2. Analyze each value in each column without human help 3. Find a pattern without human help 4. Generate a (R-Markdown) report without human help Requirements: • Interpretable • Scalable • Works (almost) out of the box
  • 11. 2 tries for a success 1st try: Subgroup mining (Failed) Find feature values common to a group of observations which are different from the rest of the dataset. 2nd try: Feature importance on decision tree based algorithm (Success) Use predictive algorithm to describe the existing data.
  • 12. 1st try: Subgroup mining algorithm Find feature values common to a group of observations which are different from the rest of the dataset. 1. Find an existing open source project 2. Check it gives interpretable results in reasonable time 3. Help project main author on: • reducing memory footprint by 50%, fixing many small bugs (2 months) • R interface (1 month) • Find and fix a huge bug in the core algorithm just before going in production (1 week) After the last bug fix, the algorithm was too slow to be used on real accounting…
  • 13. 2nd try: XGBoost Available on R, Python, Julia, CLI Fast speed and memory efficient • Can be more than 10 times faster than GBM in Sklearn and R (Benchmark on GitHub deposit) • New external memory learning implementation (based on distributed computation implementation) Distributed and Portable • The distributed version runs on Hadoop (YARN), MPI, SGE etc. • Scales to billions of examples (tested on 4 billions observations / 20 computers) XGBoost won many Kaggle competitions, like: • WWW2015 Microsoft Malware Classification Challenge (BIG 2015) • Tradeshift Text Classification • HEP meets ML Award in Higgs Boson Challenge • XGBoost is by far the most discussed tool in ongoing Otto competition
  • 14. Iterative feature importance with XGBoost (1/3) Shows which features are the most important to predict if an entry has its field PieceDate (invoice date) out of the Fiscal Year. In this example, FY is from 2010/12/01 to 2011/11/30 It is not surprising to have PieceDate among the most important features because the label is based on this feature! But the distribution of important invoice date is interesting here. Most entries out of the FY have the same invoice date: 20111201
  • 15. Iterative feature importance with XGBoost (2/3) Since in previous slide, one feature represents > 99% of the gain we remove it from the dataset and we run a new analysis. Most entries are related to the same JournalCode (nature of operation)
  • 16. Iterative feature importance with XGBoost (3/3) Entries marked as out of FY have the same invoice date, and are related to the same JournalCode. We run a new analysis without JournalCode: Most of the entries with an invoice date issue are related to Inventory accounts! That’s the kind of pattern we were looking for
  • 17. XGBoost explained in 2 pics (1/2) Classification And Regression Tree (CART) Decision tree is about learning a set of rules: if 𝑋1 ≤ 𝑡1 & if 𝑋2 ≤ 𝑡2 then 𝑅1 if 𝑋1 ≤ 𝑡1 & if 𝑋2 > 𝑡2 then 𝑅2 … Advantages: • Interpretable • Robust • Non linear link Drawbacks: • Weak Learner  • High variance
  • 18. XGBoost explained in 2 pics (2/2) Gradient boosting on CART • One more tree = loss mean decreases = more data explained • Each tree captures some parts of the model • Original data points in tree 1 are replaced by the loss points for tree 2 and 3
  • 19. Learning a model ≃ Minimizing the loss function Given a prediction 𝑦 and a label 𝑦, a loss function ℓ measures the discrepancy between the algorithm's 𝑛 prediction and the desired 𝑛 output. • Loss on training data: 𝐿 = 𝑖=1 𝑛 ℓ(𝑦𝑖, 𝑦𝑖) • Logistic loss for binary classification: ℓ 𝑦𝑖, 𝑦𝑖 = − 1 𝑛 𝑖=1 𝑛 𝑦𝑖 log 𝑦𝑖 + 1 − 𝑦𝑖 log(1 − 𝑦𝑖) Logistic loss punishes by the infinity* a false certainty in prediction 0; 1 *: lim 𝑥→0+ log 𝑥 = −∞
  • 20. Growing a tree In practice, we grow the tree greedily: • Start from tree with depth 0 • For each leaf node of the tree, try to add a split. The change of objective after adding the split is: 𝐺𝑎𝑖𝑛 = 𝐺 𝐿 2 𝐻𝐿 + 𝜆 + 𝐺 𝑅 2 𝐻 𝑅 + 𝜆 − 𝐺 𝐿 + 𝐺 𝑅 2 𝐻 𝑅 + 𝐻𝐿 + 𝜆 − 𝛾 G is called sum of residual which means the general mean direction of the residual we want to fit. H corresponds to the sum of weights in all the instances. 𝛾 and 𝜆 are 2 regularization parameters. Score of left child Score of right child Score if we don’t split Complexity cost by introducing Additional leaf Tianqi Chen. (Oct. 2014) Learning about the model: Introduction to Boosted Trees
  • 21. Gradient Boosting Iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. • Each round we learn a new tree to approximate the negative gradient and minimize the loss 𝑦𝑖 (𝑡) = 𝑦𝑖 (𝑡−1) + 𝑓𝑡(𝑥𝑖) • Loss: 𝑂𝑏𝑗(𝑡) = 𝑖=1 𝑛 ℓ 𝑦𝑖, 𝑦 𝑡−1 + 𝑓𝑡(𝑥𝑖) + Ω(𝑓𝑡) Friedman, J. H. (March 1999) Stochastic Gradient Boosting. Complexity cost by introducing additional tree Tree t predictionWhole model prediction
  • 22. Gradient descent “Gradient Boosting is a special case of the functional gradient descent view of boosting.” Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). Boosting Algorithms as Gradient Descent in Function Space. 2D View Loss Sometimes you are lucky Usually you finish here
  • 23. Building a good model for feature importance For feature importance analysis, in Simplicity Vs Accuracy trade-off, choose the first. Few rule of thumbs (empiric): • nrounds: number of trees. Keep it low (< 20 trees) • max.depth: deepness of each tree. Keep it low (< 7) • Run iteratively the feature importance analysis and remove the most important features until the 3 most important features represent less than 70% of the whole gain.
  • 24. Love XGBoost? Vote XGBoost! Otto challenge Help XGBoost open source project to spread knowledge by voting for our script explaining how to use our tool (no prize to win) https://www.kaggle.com/users/32300/tianqi-chen/otto-group-product-classification- challenge/understanding-xgboost-model-on-otto-data
  • 25. Too much time in your life? • General papers about gradient boosting: • Greedy function approximation a gradient boosting machine. J.H. Friedman • Stochastic Gradient Boosting. J.H. Friedman • Tricks used by XGBoost • Additive logistic regression a statistical view of boosting. J.H. Friedman T. Hastie R. Tibshirani (for the second-order statistics for tree splitting) • Learning Nonlinear Functions Using Regularized Greedy Forest. R. Johnson and T. Zhang (proposes to do fully corrective step, as well as regularizing the tree complexity) • Learning about the model: Introduction to Boosted Trees. Tianqi Chen. (from the author of XGBoost)