SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
Chinese Word Segmentation in MSR-NLP
Andi Wu
Microsoft Research
One Microsoft Way, Redmond, WA 98052
andiwu@microsoft.com
Abstract
Word segmentation in MSR-NLP is an in-
tegral part of a sentence analyzer which
includes basic segmentation, derivational
morphology, named entity recognition,
new word identification, word lattice
pruning and parsing. The final segmenta-
tion is produced from the leaves of parse
trees. The output can be customized to
meet different segmentation standards
through the value combinations of a set of
parameters. The system participated in
four tracks of the segmentation bakeoff --
PK-open, PK-close, CTB-open and CTB-
closed – and ranked #1, #2, #2 and #3 re-
spectively in those tracks. Analysis of the
results shows that each component of the
system contributed to the scores.
1 System Description
The MSR-NLP Chinese system that participated in
the current segmentation bakeoff is not a stand-
alone word segmenter. It is a Chinese sentence
analyzer where the leaves of parse trees are dis-
played as the output of word segmentation. The
components of this system are described below.
1.1 Basic segmentation
Each input sentence is first segmented into indi-
vidual characters. 1
These characters and their
combinations are then looked up in a dictionary2
and a word lattice containing lexicalized words
only is formed. Each node in the lattice is a feature
1
If an input line contains more than one sentence, a sentence
separator is applied to break the line into individual sentences,
which are then processed one by one and the results are con-
catenated to form a single output.
2
The lookup is optimized so that not all possible combinations
are tried.
matrix that contains the part of speech and other
grammatical attributes. Multiple-character words
may also have information for resolving segmenta-
tion ambiguities. In general, multiple-character
words are assigned higher scores than the words
they subsume, but words like “才能” are excep-
tions and such exceptional cases are usually
marked in the dictionary. For some of the words
that tend to overlap with other words, there is also
information as to what the preferred segmentation
is. For instance, the preferred segmentation for
“会议员” is “会+议员” rather than “会议+员”.
Such information was collected from segmented
corpora and stored in the dictionary. The scores
are later used in word lattice pruning and parse
ranking (Wu and Jiang 1998).
1.2 Derivational morphology and named en-
tity recognition
After basic segmentation, a set of augmented
phrase structure rules are applied to the word lat-
tice to form larger word units which include:
• Words derived from morphological proc-
esses such as reduplication, affixation,
compounding, merging, splitting, etc.
• Named entities such as person names,
place names, company names, product
names, numbers, dates, monetary units, etc.
Each of these units is a tree that reflects the history
of rule application. They are added to the existing
word lattice as single nodes and treated as single
words by the parser. The internal structures are
useful for various purposes, one of which is the
customization of word segmentation: words with
such structures can all be displayed as single words
or multiple words depending on where the “cuts”
are made in the word tree (Wu 2003).
1.3 New word identification
The expanded word lattice built in 1.2 is inspected
to detect spots of possible OOV new words. Typi-
cal spots of this kind are sequences of single char-
acters that are not subsumed by longer words. We
then use the following information to propose new
words (Wu and Jiang, 2000).
• The probability of the character string be-
ing a sequence of independent words;
• The morphological and syntactic proper-
ties of the characters;
• Word formation rules;
• Behavior of each character in existing
words (e.g. how likely is this character to
be used as the second character of a two-
character verb).
• The context in which the characters appear.
The proposed new words are added to the word
lattice and they will get used if no successful parse
can be obtained without them. When a new word
proposed this way has been verified by the parser
(i.e. used in a successful parse) more than n times,
it will automatically become an entry in the dic-
tionary. From then on, this word can be looked up
directly from the dictionary instead of being pro-
posed online. This kind of dynamic lexical acquisi-
tion has been presented in Wu et al (2002).
1.4 Word lattice pruning
Now that all the possible words are in the word
lattice, both statistical and linguistic methods are
applied to eliminate certain paths. For instance,
those paths that contain one or more bound mor-
phemes are pruned away. Single characters that
are subsumed by longer words are also thrown out
if their independent word probabilities are very
low. The result is a much smaller lattice that re-
sembles the n-best paths produced by a statistical
word segmenter. Because the final resolution of
ambiguities is expected to be done during parsing,
the lattice pruning is non-greedy so that no plausi-
ble path will be excluded prematurely. Many of
the ambiguities that are eliminated here can also be
resolved by the parser, but the pruning greatly re-
duces the complexity of the parsing process, mak-
ing the parser much faster and more accurate.
1.5 Parsing
The cleaned-up word lattice is then submitted to
the parser as the initial entries in the parsing chart.
With the assumption that a successful parse of the
sentence requires a correct segmentation of the
sentence, many segmentation ambiguities are ex-
pected to be resolved here. This assumption does
not always hold, of course. A sentence can often
be parsed in multiple ways and the top-ranking
parse is not always the correct one. There are also
sentences that are not covered by the grammar and
therefore cannot be parsed at all. In this latter case,
we back off to partial parsing and use dynamic
programming to assemble a tree that consists of the
largest sub-trees in the chart.
In most cases, the use of the parser results in
better segmentation, but the parser can also mis-
lead us. One of the problems is that the parser
treats every input as a sentence and tries to con-
struct an S out of it. As a result, even a name like
“王爱民” can be analyzed as a sentence with 王 as
the subject, 爱 as the verb and 民 as the object, if it
appears in the wrong context (or no context).
1.6 Segmentation parameters
Due to the differences in segmentation standards,
the leaves of a parse tree do not always correspond
to the words in a particular standard. For instance,
a Chinese full name is a single leaf in our trees, but
it is supposed to be two words (family name +
given name) according to the PK standard. Fortu-
nately, most of the words whose segmentation is
controversial are built dynamically in our system
with their internal structures preserved. A Chinese
full name, for example, is a word tree where the
top node dominates two nodes: the family name
and the given name. Each non-terminal node in a
word tree as described in 1.2 is associated with a
parameter whose value determines whether the
daughters of this node are to be displayed as a
singe word or multiple words. Since all the dy-
namic words are built by phrase structure rules and
their word trees reflect the derivational history of
rule application, there is a one-to-one correspon-
dence between the types of words and the word-
internal structures of those words. A segmentation
parameter is associated with each type of words3
and the value of this parameter determines how the
given type of words should be segmented. This
makes it possible for the system to quickly adapt to
different standards (Wu 2003).
3
There are about 50 parameters in our system.
1.7 Speed
Our system is not optimized for word segmentation
in terms of speed. As we have seen, the system is
a sentence analyzer and word segmentation is just
the by-product of a parser. The speed we report
here is in fact the speed of parsing.
On a single 997 MHz Pentium III machine, the
system is able to process 28,740 characters per
minute. The speed may vary according to sentence
lengths: given texts of the same size, those contain-
ing longer sentences will take more time. The
number reported here is an average of the time
taken to process the test sets of the four tracks we
participated in.
We have the option of turning off the parser
during word segmentation. When the parser is
turned off, segmentation is produced directly from
the word lattice with dynamic programming which
selects the shortest path. The speed in this case is
about 60,000 characters per minute.
2 Evaluation
We participated in the four GB tracks in the first
international Chinese word segmentation bakeoff -
PK-open, PK-closed, CTB-open and CTB-closed –
and ranked #1, #2, #2, and #3 respectively in those
tracks. In what follows, we discuss how we got the
results: what dictionaries we used, how we used
the training data, how much each component con-
tributed to the scores, and the problems that af-
fected our performance.
2.1 Dictionaries
In the open tracks, we used our proprietary dic-
tionary of 89,845 entries, which includes the en-
tries of 7,017 single characters. In the closed
tracks, we removed from the dictionary all the
words that did not appear in the training data, but
kept all the single characters. This resulted in a
dictionary of 34,681 entries in the PK track and
18,207 entries in the CTB track. It should be noted
that not all the words in the training data are in our
dictionary. This explains why the total numbers of
entries in those reduced dictionaries are smaller
than the vocabulary sizes of the respective training
sets even with all the single-character entries in-
cluded in them.
The dictionary we use in each case is not a sim-
ple word list. Every word has one or more parts-
of-speech and a number of other grammatical fea-
tures. No word can be used by the parser unless it
has those features. This made it very difficult for
us to add all the words in the training data to the
dictionary. We did use a semi-automatic process
to add as many words as possible, but both the ac-
curacy and coverage of the added grammatical fea-
tures are questionable due to the lack of manual
verification.
2.2 Use of the training data
We used the training data mainly to tune the seg-
mentation parameters of our system. As has been
mentioned in 1.6, there are about 50 types of mor-
phologically derived words that are built online in
our system and each type has a parameter to de-
termine whether a given unit should be displayed
as a single word or separate words. Since our de-
fault segmentation is very different from PK or
CTB, and PK and CTB also follow different guide-
lines, we had to try different value combinations of
the parameters in each case until we got the opti-
mal settings.
The main problem in the tuning is that many
morphologically derived words have been lexical-
ized in our dictionary and therefore do not have the
word-internal structures that they would have if
they had been constructed dynamically. As a re-
sult, their segmentation is beyond the control of
those parameters. To solve this problem, we used
the training data to automatically identify all such
cases, create a word-internal structure for each of
them, and store the word tree in their lexical en-
tries.4
This made it possible for the parameter val-
ues to apply to both the lexicalized and non-
lexicalized words. This process can be fairly
automatic if the annotation of the training data is
completely consistent. However, as we have dis-
covered, the training data is not as consistent as
expected, which made total automation impossible.
2.3 Contribution of each component
After we received our individual scores and the
reference testing data, we did some ablation ex-
4
The work is incomplete, since the trees were created only for
those words that are in the training data provided.
periments to find out the contribution of each sys-
tem component in this competition. We turned off
the components one at a time (except basic seg-
mentation) and recorded the scores of each ablated
system. The results are summarized in the follow-
ing table, where “DM-NER” stands for “deriva-
tional morphology and named entity recognition”,
“NW-ID” for “new word identification and lexical-
ization”, “pruning” for “lattice pruning” and “tun-
ing” for “tuning of parameter values”. Each cell in
the table has two percentages. The top one is the
F-measure and the bottom one is the OOV word
recall rate.
PK
Open
PK
closed
CTB
open
CTB
closed
Complete
System
95.9 %
79.9 %
94.7 %
68.0 %
90.1 %
73.8 %
83.1 %
43.1 %
Without
DM-NER
90.2 %
44.4 %
88.9 %
33.9 %
86.6 %
66.6 %
79.2 %
33.5 %
Without
NW-ID
95.8 %
77.3 %
94.0 %
61.2 %
88.7 %
69.0 %
79.2 %
28.2 %
Without
Pruning
92.0 %
77.5 %
90.9 %
65.9 %
85.5 %
69.0 %
78.8 %
39.5 %
Without
Parsing
95.5 %
79.9 %
94.4 %
68.5 %
89.8 %
75.0 %
84.0 %
48.1 %
Without
Tuning
84.8 %
43.4 %
83.9 %
33.3 %
84.8 %
72.3 %
78.4 %
43.3 %
Several interesting facts are revealed in this
break-down:
• The tuning of parameter values has the big-
gest impact on the scores across the board.
• Derivational morphology and NE recogni-
tion is also a main contributor, especially in
the PK sets, which presumably contains
more named entities.
• The impact of new word identification is
minimal when the OOV word rate is low,
such as in the PK-open case, but becomes
more and more significant as the OOV rate
increases.
• Lattice pruning makes a big difference as
well. Apparently it cannot be replaced by
the parser in terms of the disambiguating
function it performs. Another fact, which is
not represented in the table, is that parsing
is three times slower when lattice pruning is
turned off.
• The parser has very limited impact on the
scores. Looking at the data, we find that
parsing did help to resolve some of the
most difficult cases of ambiguities and we
would not be able to get the last few points
without it. But it seems that most of the
common problems can be solved without
the parser. In one case (CTB closed), the
score is higher when the parser is turned off.
This is because the parser may prefer a
structure where those dynamically recog-
nized OOV words are broken up into
smaller units. For practical purposes, there-
fore, we may choose to leave out the parser.
2.4 Problems that affected our performance
The main problem is the definition of new words.
While our system is fairly aggressive in recogniz-
ing new words, both PK and CTB are quite con-
servative in this respect. Expressions such as “援
藏”, “反腐”, “耳鼻喉”, “屡禁不绝” are
considered single words in our system but not so in
PK or CTB. This made our new word recognition
do more harm than good in many cases, though the
overall impact is positive. Consistency in the an-
notated corpora is another problem, but this affects
every participant. We also had a technical problem
where some sentences remained unsegmented sim-
ply because some characters are not in our diction-
ary.
References
Wu, Andi. 2003. Customizable segmentation of mor-
phologically derived Words in Chinese, to appear in
Computational Linguistics and Chinese Language
Processing., 8(2).
Wu, Andi, J. Pentheroudakis and Z. Jiang, 2002. Dy-
namic lexical acquisition in Chinese sentence analy-
sis. In Proceedings of the 19th
International
Conference on Computational Linguistics, pp. 1308-
1312, Taipei, Taiwan.
Wu, Andi, J. and Z. Jiang, 2000. Statistically-enhanced
new word identification in a rule-based Chinese sys-
tem, in Proceedings of the 2nd
Chinese Language
Processing Workshop, pp. 46-51, HKUST, Hong
Kong.
Wu, Andi, J. and Z. Jiang, 1998. Word segmentation in
sentence analysis, in Proceedings of 1998 Interna-
tional Conference on Chinese Information Process-
ing, pp. 46-51.169-180, Beijing, China.

Mais conteúdo relacionado

Mais procurados

Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodijnlc
 
Types of parsers
Types of parsersTypes of parsers
Types of parsersSabiha M
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous CharactersAndi Wu
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introductionThennarasuSakkan
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
Clustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati LanguageClustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati Languageijsrd.com
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesAlgoscale Technologies Inc.
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlpeSAT Journals
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationIRJET Journal
 

Mais procurados (20)

Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming method
 
Types of parsers
Types of parsersTypes of parsers
Types of parsers
 
Parser
ParserParser
Parser
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous Characters
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
6 shallow parsing introduction
6 shallow parsing introduction6 shallow parsing introduction
6 shallow parsing introduction
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
Clustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati LanguageClustering Algorithm for Gujarati Language
Clustering Algorithm for Gujarati Language
 
Towards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian LanguagesTowards Building Semantic Role Labeler for Indian Languages
Towards Building Semantic Role Labeler for Indian Languages
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 
FinalDraftRevisisions
FinalDraftRevisisionsFinalDraftRevisisions
FinalDraftRevisisions
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Precis
PrecisPrecis
Precis
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet Segmentation
 

Destaque

6 char segmentation
6 char segmentation6 char segmentation
6 char segmentationSolin TEM
 
Broad beans diseases A lecture on ToT training of FFS By Mr Allah Dad Kha...
Broad beans  diseases  A lecture on  ToT training of FFS  By Mr Allah Dad Kha...Broad beans  diseases  A lecture on  ToT training of FFS  By Mr Allah Dad Kha...
Broad beans diseases A lecture on ToT training of FFS By Mr Allah Dad Kha...Mr.Allah Dad Khan
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
A novel method for character segmentation of vehicle
A novel method for character segmentation of vehicleA novel method for character segmentation of vehicle
A novel method for character segmentation of vehicleeSAT Publishing House
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 

Destaque (6)

6 char segmentation
6 char segmentation6 char segmentation
6 char segmentation
 
Broad beans diseases A lecture on ToT training of FFS By Mr Allah Dad Kha...
Broad beans  diseases  A lecture on  ToT training of FFS  By Mr Allah Dad Kha...Broad beans  diseases  A lecture on  ToT training of FFS  By Mr Allah Dad Kha...
Broad beans diseases A lecture on ToT training of FFS By Mr Allah Dad Kha...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
A novel method for character segmentation of vehicle
A novel method for character segmentation of vehicleA novel method for character segmentation of vehicle
A novel method for character segmentation of vehicle
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 

Semelhante a Chinese Word Segmentation System from Microsoft Research

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentationPREETHIRRA2011003040
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTijcseit
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysiskrukob9
 
Word Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisWord Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisAndi Wu
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
Aspect mining and sentiment association
Aspect mining and sentiment associationAspect mining and sentiment association
Aspect mining and sentiment associationKoushik Ramachandra
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyricsFrancesco Cucari
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIES
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIESCLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIES
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIESecij
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based MethodIOSR Journals
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 

Semelhante a Chinese Word Segmentation System from Microsoft Research (20)

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentation
 
NLP todo
NLP todoNLP todo
NLP todo
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
 
Cc35451454
Cc35451454Cc35451454
Cc35451454
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysis
 
Word Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisWord Segmentation in Sentence Analysis
Word Segmentation in Sentence Analysis
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence Analysis
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Aspect mining and sentiment association
Aspect mining and sentiment associationAspect mining and sentiment association
Aspect mining and sentiment association
 
G0361034038
G0361034038G0361034038
G0361034038
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIES
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIESCLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIES
CLUSTER PRIORITY BASED SENTENCE RANKING FOR EFFICIENT EXTRACTIVE TEXT SUMMARIES
 
Improvement of Text Summarization using Fuzzy Logic Based Method
Improvement of Text Summarization using Fuzzy Logic Based  MethodImprovement of Text Summarization using Fuzzy Logic Based  Method
Improvement of Text Summarization using Fuzzy Logic Based Method
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 

Mais de Andi Wu

Learning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingLearning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingAndi Wu
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
BibleTech2010.ppt
BibleTech2010.pptBibleTech2010.ppt
BibleTech2010.pptAndi Wu
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011Andi Wu
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptxAndi Wu
 
BibleTech2015
BibleTech2015BibleTech2015
BibleTech2015Andi Wu
 
Dissertation
DissertationDissertation
DissertationAndi Wu
 

Mais de Andi Wu (7)

Learning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingLearning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve Parsing
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
BibleTech2010.ppt
BibleTech2010.pptBibleTech2010.ppt
BibleTech2010.ppt
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 
BibleTech2015
BibleTech2015BibleTech2015
BibleTech2015
 
Dissertation
DissertationDissertation
Dissertation
 

Chinese Word Segmentation System from Microsoft Research

  • 1. Chinese Word Segmentation in MSR-NLP Andi Wu Microsoft Research One Microsoft Way, Redmond, WA 98052 andiwu@microsoft.com Abstract Word segmentation in MSR-NLP is an in- tegral part of a sentence analyzer which includes basic segmentation, derivational morphology, named entity recognition, new word identification, word lattice pruning and parsing. The final segmenta- tion is produced from the leaves of parse trees. The output can be customized to meet different segmentation standards through the value combinations of a set of parameters. The system participated in four tracks of the segmentation bakeoff -- PK-open, PK-close, CTB-open and CTB- closed – and ranked #1, #2, #2 and #3 re- spectively in those tracks. Analysis of the results shows that each component of the system contributed to the scores. 1 System Description The MSR-NLP Chinese system that participated in the current segmentation bakeoff is not a stand- alone word segmenter. It is a Chinese sentence analyzer where the leaves of parse trees are dis- played as the output of word segmentation. The components of this system are described below. 1.1 Basic segmentation Each input sentence is first segmented into indi- vidual characters. 1 These characters and their combinations are then looked up in a dictionary2 and a word lattice containing lexicalized words only is formed. Each node in the lattice is a feature 1 If an input line contains more than one sentence, a sentence separator is applied to break the line into individual sentences, which are then processed one by one and the results are con- catenated to form a single output. 2 The lookup is optimized so that not all possible combinations are tried. matrix that contains the part of speech and other grammatical attributes. Multiple-character words may also have information for resolving segmenta- tion ambiguities. In general, multiple-character words are assigned higher scores than the words they subsume, but words like “才能” are excep- tions and such exceptional cases are usually marked in the dictionary. For some of the words that tend to overlap with other words, there is also information as to what the preferred segmentation is. For instance, the preferred segmentation for “会议员” is “会+议员” rather than “会议+员”. Such information was collected from segmented corpora and stored in the dictionary. The scores are later used in word lattice pruning and parse ranking (Wu and Jiang 1998). 1.2 Derivational morphology and named en- tity recognition After basic segmentation, a set of augmented phrase structure rules are applied to the word lat- tice to form larger word units which include: • Words derived from morphological proc- esses such as reduplication, affixation, compounding, merging, splitting, etc. • Named entities such as person names, place names, company names, product names, numbers, dates, monetary units, etc. Each of these units is a tree that reflects the history of rule application. They are added to the existing word lattice as single nodes and treated as single words by the parser. The internal structures are useful for various purposes, one of which is the customization of word segmentation: words with such structures can all be displayed as single words or multiple words depending on where the “cuts” are made in the word tree (Wu 2003). 1.3 New word identification The expanded word lattice built in 1.2 is inspected to detect spots of possible OOV new words. Typi- cal spots of this kind are sequences of single char-
  • 2. acters that are not subsumed by longer words. We then use the following information to propose new words (Wu and Jiang, 2000). • The probability of the character string be- ing a sequence of independent words; • The morphological and syntactic proper- ties of the characters; • Word formation rules; • Behavior of each character in existing words (e.g. how likely is this character to be used as the second character of a two- character verb). • The context in which the characters appear. The proposed new words are added to the word lattice and they will get used if no successful parse can be obtained without them. When a new word proposed this way has been verified by the parser (i.e. used in a successful parse) more than n times, it will automatically become an entry in the dic- tionary. From then on, this word can be looked up directly from the dictionary instead of being pro- posed online. This kind of dynamic lexical acquisi- tion has been presented in Wu et al (2002). 1.4 Word lattice pruning Now that all the possible words are in the word lattice, both statistical and linguistic methods are applied to eliminate certain paths. For instance, those paths that contain one or more bound mor- phemes are pruned away. Single characters that are subsumed by longer words are also thrown out if their independent word probabilities are very low. The result is a much smaller lattice that re- sembles the n-best paths produced by a statistical word segmenter. Because the final resolution of ambiguities is expected to be done during parsing, the lattice pruning is non-greedy so that no plausi- ble path will be excluded prematurely. Many of the ambiguities that are eliminated here can also be resolved by the parser, but the pruning greatly re- duces the complexity of the parsing process, mak- ing the parser much faster and more accurate. 1.5 Parsing The cleaned-up word lattice is then submitted to the parser as the initial entries in the parsing chart. With the assumption that a successful parse of the sentence requires a correct segmentation of the sentence, many segmentation ambiguities are ex- pected to be resolved here. This assumption does not always hold, of course. A sentence can often be parsed in multiple ways and the top-ranking parse is not always the correct one. There are also sentences that are not covered by the grammar and therefore cannot be parsed at all. In this latter case, we back off to partial parsing and use dynamic programming to assemble a tree that consists of the largest sub-trees in the chart. In most cases, the use of the parser results in better segmentation, but the parser can also mis- lead us. One of the problems is that the parser treats every input as a sentence and tries to con- struct an S out of it. As a result, even a name like “王爱民” can be analyzed as a sentence with 王 as the subject, 爱 as the verb and 民 as the object, if it appears in the wrong context (or no context). 1.6 Segmentation parameters Due to the differences in segmentation standards, the leaves of a parse tree do not always correspond to the words in a particular standard. For instance, a Chinese full name is a single leaf in our trees, but it is supposed to be two words (family name + given name) according to the PK standard. Fortu- nately, most of the words whose segmentation is controversial are built dynamically in our system with their internal structures preserved. A Chinese full name, for example, is a word tree where the top node dominates two nodes: the family name and the given name. Each non-terminal node in a word tree as described in 1.2 is associated with a parameter whose value determines whether the daughters of this node are to be displayed as a singe word or multiple words. Since all the dy- namic words are built by phrase structure rules and their word trees reflect the derivational history of rule application, there is a one-to-one correspon- dence between the types of words and the word- internal structures of those words. A segmentation parameter is associated with each type of words3 and the value of this parameter determines how the given type of words should be segmented. This makes it possible for the system to quickly adapt to different standards (Wu 2003). 3 There are about 50 parameters in our system.
  • 3. 1.7 Speed Our system is not optimized for word segmentation in terms of speed. As we have seen, the system is a sentence analyzer and word segmentation is just the by-product of a parser. The speed we report here is in fact the speed of parsing. On a single 997 MHz Pentium III machine, the system is able to process 28,740 characters per minute. The speed may vary according to sentence lengths: given texts of the same size, those contain- ing longer sentences will take more time. The number reported here is an average of the time taken to process the test sets of the four tracks we participated in. We have the option of turning off the parser during word segmentation. When the parser is turned off, segmentation is produced directly from the word lattice with dynamic programming which selects the shortest path. The speed in this case is about 60,000 characters per minute. 2 Evaluation We participated in the four GB tracks in the first international Chinese word segmentation bakeoff - PK-open, PK-closed, CTB-open and CTB-closed – and ranked #1, #2, #2, and #3 respectively in those tracks. In what follows, we discuss how we got the results: what dictionaries we used, how we used the training data, how much each component con- tributed to the scores, and the problems that af- fected our performance. 2.1 Dictionaries In the open tracks, we used our proprietary dic- tionary of 89,845 entries, which includes the en- tries of 7,017 single characters. In the closed tracks, we removed from the dictionary all the words that did not appear in the training data, but kept all the single characters. This resulted in a dictionary of 34,681 entries in the PK track and 18,207 entries in the CTB track. It should be noted that not all the words in the training data are in our dictionary. This explains why the total numbers of entries in those reduced dictionaries are smaller than the vocabulary sizes of the respective training sets even with all the single-character entries in- cluded in them. The dictionary we use in each case is not a sim- ple word list. Every word has one or more parts- of-speech and a number of other grammatical fea- tures. No word can be used by the parser unless it has those features. This made it very difficult for us to add all the words in the training data to the dictionary. We did use a semi-automatic process to add as many words as possible, but both the ac- curacy and coverage of the added grammatical fea- tures are questionable due to the lack of manual verification. 2.2 Use of the training data We used the training data mainly to tune the seg- mentation parameters of our system. As has been mentioned in 1.6, there are about 50 types of mor- phologically derived words that are built online in our system and each type has a parameter to de- termine whether a given unit should be displayed as a single word or separate words. Since our de- fault segmentation is very different from PK or CTB, and PK and CTB also follow different guide- lines, we had to try different value combinations of the parameters in each case until we got the opti- mal settings. The main problem in the tuning is that many morphologically derived words have been lexical- ized in our dictionary and therefore do not have the word-internal structures that they would have if they had been constructed dynamically. As a re- sult, their segmentation is beyond the control of those parameters. To solve this problem, we used the training data to automatically identify all such cases, create a word-internal structure for each of them, and store the word tree in their lexical en- tries.4 This made it possible for the parameter val- ues to apply to both the lexicalized and non- lexicalized words. This process can be fairly automatic if the annotation of the training data is completely consistent. However, as we have dis- covered, the training data is not as consistent as expected, which made total automation impossible. 2.3 Contribution of each component After we received our individual scores and the reference testing data, we did some ablation ex- 4 The work is incomplete, since the trees were created only for those words that are in the training data provided.
  • 4. periments to find out the contribution of each sys- tem component in this competition. We turned off the components one at a time (except basic seg- mentation) and recorded the scores of each ablated system. The results are summarized in the follow- ing table, where “DM-NER” stands for “deriva- tional morphology and named entity recognition”, “NW-ID” for “new word identification and lexical- ization”, “pruning” for “lattice pruning” and “tun- ing” for “tuning of parameter values”. Each cell in the table has two percentages. The top one is the F-measure and the bottom one is the OOV word recall rate. PK Open PK closed CTB open CTB closed Complete System 95.9 % 79.9 % 94.7 % 68.0 % 90.1 % 73.8 % 83.1 % 43.1 % Without DM-NER 90.2 % 44.4 % 88.9 % 33.9 % 86.6 % 66.6 % 79.2 % 33.5 % Without NW-ID 95.8 % 77.3 % 94.0 % 61.2 % 88.7 % 69.0 % 79.2 % 28.2 % Without Pruning 92.0 % 77.5 % 90.9 % 65.9 % 85.5 % 69.0 % 78.8 % 39.5 % Without Parsing 95.5 % 79.9 % 94.4 % 68.5 % 89.8 % 75.0 % 84.0 % 48.1 % Without Tuning 84.8 % 43.4 % 83.9 % 33.3 % 84.8 % 72.3 % 78.4 % 43.3 % Several interesting facts are revealed in this break-down: • The tuning of parameter values has the big- gest impact on the scores across the board. • Derivational morphology and NE recogni- tion is also a main contributor, especially in the PK sets, which presumably contains more named entities. • The impact of new word identification is minimal when the OOV word rate is low, such as in the PK-open case, but becomes more and more significant as the OOV rate increases. • Lattice pruning makes a big difference as well. Apparently it cannot be replaced by the parser in terms of the disambiguating function it performs. Another fact, which is not represented in the table, is that parsing is three times slower when lattice pruning is turned off. • The parser has very limited impact on the scores. Looking at the data, we find that parsing did help to resolve some of the most difficult cases of ambiguities and we would not be able to get the last few points without it. But it seems that most of the common problems can be solved without the parser. In one case (CTB closed), the score is higher when the parser is turned off. This is because the parser may prefer a structure where those dynamically recog- nized OOV words are broken up into smaller units. For practical purposes, there- fore, we may choose to leave out the parser. 2.4 Problems that affected our performance The main problem is the definition of new words. While our system is fairly aggressive in recogniz- ing new words, both PK and CTB are quite con- servative in this respect. Expressions such as “援 藏”, “反腐”, “耳鼻喉”, “屡禁不绝” are considered single words in our system but not so in PK or CTB. This made our new word recognition do more harm than good in many cases, though the overall impact is positive. Consistency in the an- notated corpora is another problem, but this affects every participant. We also had a technical problem where some sentences remained unsegmented sim- ply because some characters are not in our diction- ary. References Wu, Andi. 2003. Customizable segmentation of mor- phologically derived Words in Chinese, to appear in Computational Linguistics and Chinese Language Processing., 8(2). Wu, Andi, J. Pentheroudakis and Z. Jiang, 2002. Dy- namic lexical acquisition in Chinese sentence analy- sis. In Proceedings of the 19th International Conference on Computational Linguistics, pp. 1308- 1312, Taipei, Taiwan. Wu, Andi, J. and Z. Jiang, 2000. Statistically-enhanced new word identification in a rule-based Chinese sys- tem, in Proceedings of the 2nd Chinese Language Processing Workshop, pp. 46-51, HKUST, Hong Kong. Wu, Andi, J. and Z. Jiang, 1998. Word segmentation in sentence analysis, in Proceedings of 1998 Interna- tional Conference on Chinese Information Process- ing, pp. 46-51.169-180, Beijing, China.