What is word segmentation

  • Introduction to Chinese Word Segmentation
    • What is participle
    • What are the word segmentation algorithms
    • What is a good word segmentation algorithm
  • Based on matching rules
    • Forward-max matching
    • Backward-max matching
    • Bi-direction Matching
  • Based on probability statistics
    • Language model
    • HMM/CRF

Tell a story

Sun / according to the incense burner / raw / purple smoke
 Rizhao/ Incense Burner/ Raw/ Purple Smoke


Let’s learn the word segmentation algorithm together.

Introduction to Chinese Word Segmentation

What is participle

Borrowing the definition of Baidu Encyclopedia: Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications

What are the word segmentation algorithms

Here are roughly divided into two categories according to the method of word segmentation: rule-based word segmentation and statistics-based word segmentation

  • Rule-based word segmentation
    • Forward maximum match
    • Backward max match
    • Minimal segmentation (to minimize the number of words cut out in each sentence)
    • Two-way maximum match
  • Based on statistics
    • Language model
    • HMM
    • CRF
    • Deep learning

What is a good word segmentation algorithm

This involves the design principles of word segmentation algorithm:

  • The larger the particle size, the better

  • The fewer non-dictionary words in the segmentation result, the better, and the fewer words in the single-character dictionary, the better

  • The less the overall word count, the better

Based on matching rules

Forward-max matching

Example: [Rizhao incense burner produces purple smoke]
 Dictionary: ["日","日照","香炉","照香炉","生","紫烟"]

Suppose we set the maximum length of words to 5, let’s see how to proceedForward maximum match

first round:

​ «Rizhao incense burner Sheng», there is no match for this word in the dictionary

​ «Rizhao Incense Burner», unmatched

​ «Rizhaoxiang», unmatched

​ «Sunshine», matching

second round:

​ «The incense burner produces purple smoke», unmatched

​ «The incense burner produces purple», unmatched

​ «Censer Sheng», not matched

​ «Censer», match

The third round:

​ «Purple Smoke», unmatched

​ «Purple», no match

​ «Born», match

Fourth round:

​ «Purple Smoke», match

The final word segmentation result is: Rizhao/ Incense Burner/ Sheng Ziyan

Code:

#Forward max match
def forward_max_matching(text,maxlen,vocab):
    results = []
    while text:
                 #Take the maximum length substring to match
        if len(text)<maxlen:
            subtext = text
        else:
            subtext = text[0:maxlen]
        while subtext:
            if subtext in vocab:
                results.append(subtext)
                text = text[len(subtext):]
                break
            else:
                subtext = subtext[0:len(subtext)-1]
    return results

Backward-max matching

Example: [Rizhao incense burner produces purple smoke]
 Dictionary: ["日","日照","香炉","照香炉","生","紫烟"]

Suppose we also set the maximum length of words to 5, let’s see how to proceedBackward max match

first round:

​ «The incense burner produces purple smoke», unmatched

​ «The furnace produces purple smoke», not matched

​ «Purple Smoke», unmatched

​ «Purple Smoke», match

second round:

​ «Rizhao Incense Burner», unmatched

​ «According to the incense burner», unmatched

​ «Censer Sheng», not matched

​ «Furnace», not matched

​ «Born», match

The third round:

​ «Rizhao Incense Burner», unmatched

​ «According to the incense burner», matching

Fourth round:

​ «Day», matching

The final word segmentation result is: day / according to incense burner / raw / purple smoke

It is found that the two word segmentation results are different! !

Code:

#Backward Maximum Match
def backward_max_matching(text,maxlen,vocab):
    results = []
    while text:
                 #Take the maximum length substring to match
        if len(text)<maxlen:
            subtext = text
        else:
            subtext = text[-maxlen:]
        while subtext:
            if subtext in vocab:
                results.append(subtext)
                text = text[:-len(subtext)]
                break
            else:
                subtext = subtext[-(len(subtext)-1):]
    return results[::-1]

Bi-direction Matching

Compare the forward maximum matching algorithm with the backward maximum matching algorithm to determine the correct word segmentation method

Algorithm flow:

  1. Compare the results of forward maximum matching and reverse maximum matching
  2. If the number of word segmentation results are different, then the one with the less number of word segmentation
  3. If the number of word segmentation results are the same
    • Same word segmentation results, any one can be returned
    • Different word segmentation results, return the one with fewer words
def bidirection_matching(text,maxlen,vocab):
    results = []
    forward = forward_max_matching(text,maxlen,vocab)
    backward = backward_max_matching(text,maxlen,vocab)[::-1]
         # The number of words in the front and back results is different, and the number of words returned is small
    if len(forward)!=len(backward):
        return forward if len(forward)<len(backward) else backword
    else:
                 #The number of word segmentation is the same, the result of segmentation is the same
        if forward == backward:
            return forward
                 else:#Different word segmentation results, return the one with fewer words
            for_single = [word for word in forward if len(word)==1]
            back_single = [word for word in backward if len(word)==1]
            if len(for_single) < len(back_single):
                return forward
            else:
                return backward

Intuitively, how to get a good word segmentation result?

Enter text --> find all possible splits --> choose the best result 

It can be seen that in the rule-based matching method, the results of word segmentation are all local optimal solutions. More importantly, this word segmentation method does not consider the semantic information of the sentence. How to choose the best from all possible word segmentation results, which requires the presence of a language model

Based on probability statistics

Language model

The language model is a model used to calculate the probability of a sentence, that is, the probability of judging whether a sentence is a human speech?

Given sentence

[S=W_1,W_2,…,W_K ]

The probability can be expressed as

[p(S) = p(W_1,W_2,…,W_K) ]

Here, according to the Markov hypothesis, a word can be specified to depend on the previous words. Here we assume that the appearance of each word is independent of each other, that is, the unary language model, so the probability is expressed as

[p(S) = p(W_1,W_2,…,W_K)=p(W_1)p(W_2)…p(W_K) ]

We know that relative to the entire corpus, a word has a very low probability of appearing. Multiplying multiple decimals may result in -inf, so the logarithm becomes an addition, and the largest result is the best segmentation result.

There is another problem here. According to what I said earlier, the process of generating all input word segmentation results is too inefficient. We need a way to integrate the process of generating word segmentation and calculating probability, using probability graphs! !

The above figure is a simple schematic, each line segment represents a character or word, and p represents the probability of the corresponding word in the dictionary database. If we use the unary language model, all we need to do is to find the path with the smallest probability product and use dynamic programming to achieve the shortest path.

HMM/CRF

Use sequence labeling method to solve word segmentation and label each word:

B (beginning), M (middle), E (end), S (independently formed words) four states

That’s it. When you actually use it, you need to make a choice according to the scene. For example, when the search engine analyzes the content of a large-scale web page, the speed of word segmentation is greater than the accuracy. The accuracy requirement is greater than the speed.

references

Matching method.https://blog.csdn.net/selinda001/article/details/79345072

Overview of Chinese word segmentation.https://zhuanlan.zhihu.com/p/67185497

Overview of word segmentation algorithms.https://zhuanlan.zhihu.com/p/50444885

InfoSci-OnDemand Search

InfoSci-OnDemand

Download Premium Research Papers

Full text search our database of 176,600 titles for Word Segmentation to find related research papers.

Learn More About Word Segmentation in These Related Titles

Transforming Scholarly Publishing With Blockchain Technologies and AI

Copyright 2021. 336 pages.

Every industry will be positively affected by bloc…

In Stock

Creating a Framework for Dissertation Preparation: Emerging Research and Opportunities

Copyright 2020. 246 pages.

The dissertation processes across various discipli…

In Stock

Data Visualization and Statistical Literacy for Open and Big Data

Copyright 2017. 365 pages.

Data visualization has emerged as a serious schola…

In Stock

Handbook of Research on Disaster Management and Contingency Planning in Modern Libraries

Copyright 2016. 676 pages.

Special interest in topics relating to library man…

In Stock

Contemporary Approaches to Dissertation Development and Research Methods

Copyright 2016. 360 pages.

Uncovering the best methods for conducting and wri…

In Stock

Handbook of Research on Scholarly Publishing and Research Methods

Copyright 2015. 582 pages.

For faculty to advance their careers in higher edu…

In Stock

E-Reference Context and Discoverability in Libraries: Issues and Concepts

Copyright 2012. 312 pages.

Without question, reference collections have chang…

In Stock

Planning and Implementing Resource Discovery Tools in Academic Libraries

Copyright 2012. 754 pages.

“Resource discovery” has many meanings, and it is…

In Stock

Encyclopedia of Knowledge Management, Second Edition

Copyright 2011. 1730 pages.

Knowledge Management has evolved into one of the m…

In Stock

Implementing Information Technology Governance: Models, Practices and Cases

Copyright 2008. 270 pages.

In many organizations, information technology (IT)…

In Stock

What is tokenization?

Tokenization involves breaking text into individual words, making it easier for computers to understand and analyze meaning. This task applies to various Natural Language Processing (NLP) applications such as language translation, text summarization, and sentiment analysis.

In this post, we will explore using SentencePiece, a widely used open-source library for tokenization in Python.

Setup

Before diving into the implementation, we have to download the IMDB Movie Reviews Dataset, which can be found here.

Once the dataset is downloaded, consolidate the reviews into a single file, named reviews.txt, with each review on a separate line.

Unsupervised segmentation

SentencePiece

SentencePiece is an open-source library that allows for unsupervised tokenization.

Unlike traditional tokenization methods, SentencePiece is reversible. It can reconstruct the original text given a dictionary of known words and a sequence of token IDs. This means there is no information loss.

SentencePiece is also grammar-agnostic, meaning it can learn a segmentation model from raw bytes. This makes it suitable for many different text types, including logographic languages like Chinese. SentencePiece eliminates the need for large language models, and updating new terms is a breeze.

To use SentencePiece for tokenization in Python, you must first import the necessary modules. If you do not have sentencepiece installed, use pip install sentencepiece.

import os
import sentencepiece as spm

Once you have the necessary modules imported, you can use SentencePiece to train a model on your text data. The following code will train a model on the “reviews.txt” file and save it as “reviews.model”:

if not os.path.isfile('reviews.model'):
    spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False')

sp = spm.SentencePieceProcessor()
sp.load('reviews.model')

Once you have trained your SentencePiece model, you can use it to tokenize new text. Here’s an example of how to use the SentencePiece tokenizer model for tokenizing an unseen sentence:

sentence = "The quick brown fox jumps over the lazy dog"

sp.EncodeAsPieces(sentence)
>> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog']

sp.EncodeAsIds(sentence)
>> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]

Traditional segmentation

NLTK

One of the most well-known and widely used libraries is the Natural Language Toolkit (NLTK). Initially released in 2001, NLTK was designed for research and is considered a powerful tool for many NLP tasks, such as parse trees. However, it is not typically considered “production-ready”.

To use NLTK for tokenization, you will need first to download the necessary packages, such as the punkt package:

Once you have the required packages installed, you can use the word_tokenize function to tokenize the text into individual words:

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

SpaCy

Another popular library for tokenization in natural language processing and machine learning tasks is SpaCy. This library is considered industrial-strength and has a user-friendly API, making it easy to perform everyday NLP tasks such as entity recognition.

To use SpaCy for tokenization, you will first need to download the necessary packages by running the following command:

python -m spacy download en

Once the packages are installed, you can use the following code to tokenize the text into individual words:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [token.text for token in doc]

stanfordnlp

Another popular library for tokenization in natural language processing and machine learning tasks is stanfordnlp.

This library is built on the Java-based CoreNLP library and offers advanced features such as a Bi-Directional LSTM Network with GPU acceleration.

To use stanfordnlp for tokenization, you will need to download the necessary packages by running the following command:

stanfordnlp.download('en')

Once the packages are installed, you can use the following code to tokenize the text into individual words:

import stanfordnlp

nlp = stanfordnlp.Pipeline()
doc = nlp(text)
print([word.text for sentence in doc.sentences for word in sentence.tokens])

Other libraries

Other popular libraries, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, don’t use their own segmentation methods. Gensim provides a tokenization function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned.

Most algorithm-specific libraries use other methods; for example, BERT uses an import of a previously made vocab file.

Basic methods

When it comes to basic segmentation methods in Python, we can use the split() function for basic segmentation tasks.

Split method

First, we’ll pick a random movie review and then use the split() function to separate the text into individual words:

import re, random

reviews = open('reviews.txt').readlines()
text = random.choice(reviews)
words = text.split()
print(words)

The resulting output will be a list of words, separated by the default delimiter, which is whitespace:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, …]

The split() method is a simple approach, but it has its limitations. Punctuation is mixed in with regular words, which makes it less accurate.

Regular expressions

Another method for tokenization is using regular expressions, which allows for more control and specificity.

Here’s an example:

words = re.findall('w+',text)
print(words)

The resulting output will be a list of words, with punctuation and other non-alphanumeric characters removed:

[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, …]

This result isn’t too bad, but there are still some problems. This method can result in information loss due to the absence of punctuation and other non-alphanumeric characters, especially in UTF-8.

This method also does not consider words that contain spaces, such as “New York City,” which should be regarded as one word.

Benchmark comparison

To demonstrate the efficiency of SentencePiece, we can compare its performance to other popular tokenization methods.

The following plot shows the tokenization time of 1,000 movie reviews using an AMD Ryzen 7 CPU and 16GB DDR4 on Xubuntu.

Segmentation time of 1,000 movie reviews using AMD Ryzen 7 CPU, 16GB DDR4 on Xubuntu

Hi, I’m Nathan. Thanks for reading! Keep an eye out for more content being posted soon.

I understand the participle

Participle is NLP The basic task is to decompose sentences and paragraphs into word units to facilitate the analysis of subsequent processing.

This article will introduce the reasons for word segmentation, the 3 difference between Chinese and English word segmentation, the 3 difficulty of Chinese word segmentation, and the typical 3 method of word segmentation. Finally, the tools commonly used for Chinese word segmentation and English word segmentation will be introduced.

To learn more about NLP-related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

What is a participle?

Participle is Natural Language Understanding – NLP Important steps.

Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.

What is a participle?

Why are you dividing words?

1. Turning complex problems into mathematical problems

On Machine learning article As mentioned in it, machine learning seems to solve many complicated problems because it translates these problems into mathematical problems.

NLP is the same idea. Text is some «unstructured data». We need to convert this data into «structured data» first. Structured data can be transformed into mathematics, and word segmentation is the first transformation. step.

Why are you dividing words?

2. word is a more suitable granularity

A word is the smallest unit that expresses the full meaning.

The granularity of the word is too small to express the full meaning. For example, «rat» can be «mouse» or «mouse».

The granularity of sentences is too large, and the amount of information carried is large, which is difficult to reuse. For example, «the traditional method of word segmentation, an important reason is that the traditional method has a weak ability to model long-distance dependence.»

Word is the appropriate granularity

3. In the era of deep learning, some tasks can also be «segmented»

In the era of deep learning, with the explosive growth of data volume and computing power, many traditional methods have been subverted.

The word segmentation has always been the basis of NLP, but it is not necessarily the case now. If you are interested, you can check out this paper:Is Word Segmentation Necessary for Deep Learning of Chinese Representations?》.

Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

However, in some specific tasks, participles are still necessary. Such as: keyword extraction, named entity recognition, etc.

3 typical differences between Chinese and English participles

3 typical differences between Chinese and English participles

Differentiating 1: Different ways of word segmentation, Chinese is more difficult

English has a natural space as a separator, but Chinese does not. So how to divide is a difficult point, plus the fact that there is a lot of meaning in Chinese, which makes it easy to be ambiguous. The difficulties in the following sections will be explained in detail.

Differentiating 2: English words have multiple forms

There are rich transformations in English words. In order to deal with these complex transformations, English NLP has some unique processing steps compared to Chinese, which we call Lemmatization and stem extraction (Stemming). Chinese does not need

Part of speech restoration: does, done, doing, did need to be restored to do by part of speech restoration.

Stem extraction: cities, children, teeth These words need to be converted to city, child, tooth»

Differentiating 3: Chinese word segmentation needs to consider the granularity problem

For example, the «University of Science and Technology of China» has a variety of divisions:

  • University of Science and Technology of China
  • ChinaScience and TechnologyUniversity
  • ChinaScienceTechnologyUniversity

The larger the granularity, the more accurate the meaning of the expression, but it will also result in fewer recalls. So Chinese needs different scenarios and requires different granularities. This is not in English.

3 big difficulty of Chinese word segmentation

3 big difficulty of Chinese word segmentation

Difficult 1: There is no uniform standard

There is no uniform standard for Chinese word segmentation at present, and there is no recognized norm. Different companies and organizations have their own methods and rules.

Difficult 2: How to distinguish ambiguous words

For example, «the auction of the table tennis ball» has 2 word segmentation to express the different meanings of 2:

  • Table tennis auction finished
  • Ping Pong Racket Selling Finished

Difficult 3: Identification of new words

In the era of information explosion, a bunch of new words will emerge at the end of the three days. How to quickly identify these new words is a major difficulty. For example, when the «blue thin mushroom» fire was in the past, it needed to be quickly identified.

3 typical word segmentation method

3 typical word segmentation method

The method of word segmentation is roughly divided into 3 classes:

  1. Dictionary based matching
  2. Based on statistics
  3. Deep learning

Give the dictionary a matching word segmentation

Advantages: fast speed and low cost

Disadvantages: not adaptable, large difference in effect in different fields

The basic idea is based on dictionary matching. The Chinese text of the word to be segmented is divided and adjusted according to certain rules, and then matched with the words in the dictionary. If the matching is successful, the word segmentation according to the dictionary is used. If the matching fails, the adjustment is repeated or re-selected. Just fine. Representative methods are based on forward maximum matching and based on inverse maximum matching and two-way matching.

Statistical segmentation method

Advantages: strong adaptability

Disadvantages: higher cost and slower speed

The current commonly used algorithm isHMM, CRF,SVMDeep learningAlgorithms such as stanford and Hanlp word segmentation tools are based on the CRF algorithm. Taking CRF as an example, the basic idea is to mark the Chinese characters. It not only considers the frequency of occurrence of words, but also considers the context and has good learning ability. Therefore, it has good effects on the identification of ambiguous words and unregistered words.

Deep learning

Advantages: high accuracy and adaptability

Disadvantages: high cost and slow speed

For example, someone tries to use two-wayLSTM+CRF implements a tokenizer, which is essentially a sequence label, so it can be used for versatility, named entity recognition, etc. It is reported that its word breaker character accuracy can be as high as 97.5%.

Common word breakers use a combination of machine learning algorithms and dictionaries to improve segmentation accuracy on the one hand and domain adaptability on the other.

The ranking below is ranked according to the number of stars on GitHub:

  1. Hanlp
  2. Stanford participle
  3. Ansj word breaker
  4. Harbin Institute of Technology LTP
  5. KCWS word breaker
  6. Jieba
  7. IK
  8. Tsinghua University THULAC
  9. ICTCLAS

English word segmentation tool

  1. Hard
  2. spacy
  3. Gensim
  4. NLTK

Final Thoughts

Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.

Reasons for word segmentation:

  1. Turn complex problems into mathematical problems
  2. Word is a more appropriate granularity
  3. In the era of deep learning, some tasks can also be «divided»

Typical differences between 3 in Chinese and English participles:

  1. Different ways of word segmentation, Chinese is more difficult
  2. English words have multiple forms, requiring part of speech restoration and stemming
  3. Chinese word segmentation needs to consider the granularity problem

3 big difficulty of Chinese word segmentation

  1. No uniform standard
  2. How to distinguish ambiguous words
  3. New word recognition

3 typical word segmentation:

  1. Dictionary based matching
  2. Based on statistics
  3. Deep learning

Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

Chinese word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. We know that in English, words are spaces with natural delimiters, while Chinese is just words, sentences and paragraphs that can be delimited by explicit delimiters. Only words have no formal delimiter. Although English also has the problem of the division of phrases, at the level of words, Chinese is much more complicated and much more difficult than English.

Read More

Wikipedia version

ParticipleIt is the process of dividing and possibly classifying a series of input characters. The resulting mark is then passed to some other form of processing. This process can be thought of as a subtask that parses the input.

Read More

Spaces are a relatively recent invention in Western languages. Although ancient Hebrew and Arabic used spaces to separate words, partly to compensate for the lack of vowels, they were not used in Latin until 600 to 800 A.D. When the Latin alphabet was adopted for English, it was written scripta continua, without any word separators. Later, centered dots were added to make reading easier, and subsequently the dots were replaced with spaces. Today, languages in the CJK (Chinese-Japanese-Korean) family are written without using any spaces or other word delimiters; so are the Thai, Khmer (Cambodian), and Lao languages. This introduces the need for segmentation algorithms to separate words for indexing.

Segmenting words

Finding word boundaries in the absence of spaces is a non-trivial problem, and ambiguities often arise. To help you appreciate the problem, Figure 8.11a shows two interpretations of the same Chinese characters. The text is a play on the ambiguity of phrasing. Once upon a time, the story goes, a man embarked on a long journey. Before he could return home the rainy season began, and he took shelter at a friend’s house. As the rains continued he overstayed his welcome, and his friend wrote him a note: the first line in Figure 8.11a. As shown in the second line, it reads «It is raining, the god would like the guest to stay. Although the god wants you to stay, I do not!» But before taking the hint and leaving, the visitor added the punctuation shown in the third line, making three sentences whose meaning is totally different—»The rainy day, the staying day. Would you like me to stay? Sure!»

This is an example of ambiguity related to phrasing, but ambiguity also can arise with word segmentation. Figure 8.11b shows a more prosaic example. For the ordinary sentence on the first line, there are two different interpretations, depending on the context: «I like New Zealand flowers» and «I like fresh broccoli.»

Written Chinese documents are unsegmented, and readers are accustomed to inferring the corresponding sequence of words almost unconsciously. Accordingly, machine-readable text is usually unsegmented.

Alternative interpretations of two Chinese sentences: (a) ambiguity caused by phrasing; (b) ambiguity caused by word boundaries

Figure 8.11: Alternative interpretations of two Chinese sentences: (a) ambiguity caused by phrasing; (b) ambiguity caused by word boundaries

To render them suitable for full-text retrieval, a segmentation scheme should be used to insert word boundaries at appropriate positions prior to indexing.

One segmentation method is to use a language dictionary. Boundaries are inserted to maximize the number of the words in the text that are also present in the dictionary. Of course, there may be multiple valid segmentations, and heuristics are needed to resolve ambiguities.

Another method is based on the fact that text divided into words is more compressible than text that lacks word boundaries. You can demonstrate this with a simple experiment. Take a text file, compress it with any standard compression utility (such as gzip), and measure the compression ratio. Then remove all the spaces from the file, making it considerably smaller (about 17 percent smaller, because in English approximately one letter in six is a space). When you compress this smaller file, the compression ratio is noticeably worse than for the original file. Inserting word boundaries improves compressibility.

This fact can be used to divide text into words, based on a large corpus of hand-segmented training data. Between every two characters lies a potential space. A text compression model can be trained on presegmented text, and coupled with a search algorithm to interpolate spaces to maximize the overall compression. Section 8.5 («Notes and sources») at the end of this topic points to a fuller explanation of the technique.

For non-Chinese readers, the success of the space-insertion technique can be illustrated by applying it to English. Table 8.4 shows original text at the top, complete with spaces. Below is the input to the segmentation procedure. Underneath that is the output of two segmentation schemes: one dictionary-based and the other compression-based. The training text was a substantial sample of English, although far smaller than the corpus used to produce the word dictionary.

Word-based segmentation fails badly when the words are not in the dictionary. In this case both cro-cidolite and Micronite are segmented incorrectly. In addition, inits is treated as a single word because it occurred that way in the text from which the dictionary was created, and in cases of ambiguity the algorithm prefers longer words. The strength of the compression-based method is that it performs well on unknown words. Although Micronite does not occur in the training corpus, it is correctly segmented. The compression-based method makes two errors, however. First, a space was not inserted into LoewsCorp because it happens to require fewer bits to encode than Loews Corp. Second, an extra space was added to crocidolite because that also reduced the number of bits required.

Table 8.4: Segmenting words in English text

Original text

the unit of New York-based Loews Corp that makes Kent cigarettes stopped using crocidolite in its Micronite cigarette filters in 1956.

Without spaces

the unit of New York-based Loews Corp that makes Kent cig

arettesstoppedusingcrocidoliteinitsMicronitecigarettefiltersin1956.

Word-based segmentation

the unit of New York-based Loews Corp that makes Kent cigarettes stopped using c roc id o lite inits Micron it e cigarette filters in 1956.

Character-based segmentation

the unit of New York-based LoewsCorp that makes Kent cigarettes stopped using croc idolite in its Micronite cigarette filters in 1956.

Segmenting words in Thai/Khmer/Lao

Thai, Khmer, and Lao are other languages that do not use spaces between the words in a sentence, although they do include spaces between phrases and sentences. Unlike the CJK family, which uses ideographs, they are alphabetic languages. However, they are easier to read than English would be if spaces were omitted, because English provides fewer clues about word breaks.

In Thai, for example, there are many rules that govern where words can begin and end. Thai includes a «silence marker» called gaaran—the little symbol that appears above the last letter in the word tmp104-1_thumbwhich indicates that the letter or letters underneath are not pronounced in the usual Thai pronunciation. Here are some of the rules:

• gaaran ends a word (except for some European loan words, such astmp104-2_thumbfor golf); •tmp104-3_thumbends a word (except in the rare cases when it is followed by a consonant with the gaaran symbol, as intmp104-4_thumb

tmp104-5_thumbends a word;

• the vowelstmp104-6_thumbstart a word (these are called «preposed» vowels, and are written before their accompanying consonant).

These rules help to determine some word boundaries, but not all. Sometimes ambiguities arise. For example, depending on how it is segmented, tmp104-7_thumb means (roughly) breath of fresh airtmp104-8_thumbor round eyestmp104-9_thumbAs with Chinese, there are  several possible approaches to determining word boundaries for searching.

Sorting Chinese text

Several different schemes underlie printed Chinese dictionaries and telephone directories. Characters can be ordered according to the number of strokes they contain; or they can be ordered according to their radical, which is a core symbol on which they are built; or they can be ordered according to a standard alphabetical representation called Pinyin, where each ideograph is given a one- to six-letter equivalent. Stroke ordering is probably the most natural way of ordering character strings for Chinese users, although many educated people prefer Pinyin (not all Chinese people know Pinyin). This presents a problem when creating lists of Chinese text that are intended for browsing.

To help you appreciate the issues involved, Figure 8.12 shows title browsers for a large collection of Chinese documents. The rightmost button on the green access bar near the top invokes Figure 8.12a. Here, titles are ordered by the number of strokes in their first character, which is given across the top: the user has selected six. In all the titles that follow, the first character has six strokes. This is probably not obvious from the display, because you can only count the strokes in a character if you know how to write it.

Browsing a list of titles in Chinese: (a) stroke-based browsing; (b) Pinyin browsing

Figure 8.12: Browsing a list of titles in Chinese: (a) stroke-based browsing; (b) Pinyin browsing

To illustrate this, the initial characters for the first and seventh titles are singled out and their writing sequence is superimposed on the screen shot: the first stroke, the first two strokes, the first three strokes, and so on, ending with the complete character, which is circled. All people who read Chinese know immediately how many strokes are needed to write any particular character.

There are generally more than 200 characters corresponding to a given number of strokes, almost any of which could occur as the first character of a title. Hence the titles in each group are displayed in a particular conventional order, again determined by their first character. This ordering is more complex. Each character has an associated radical, the basic structure that underlies it. For example, the radical in the upper example singled out in Figure 8.12a (the first title) is the pattern corresponding to the initial two strokes, which in this case form the left-hand part of the character. Radicals have a conventional ordering that is well known to educated Chinese; this particular one is number 9 in the Unicode sequence. Because this character requires four more strokes than its radical, it is designated by the two-part code 9.4. In the lower example singled out in Figure 8.12a (the seventh title), the radical corresponds to the initial three strokes, which form the top part of the character, and is number 40; thus this character receives the designation 40.3. These codes are shown to the right of Figure 8.12a but would not form part of the final display.

The codes form the key on which titles are sorted. Characters are grouped first by radical number, then by how many strokes are added to the radical to make the character. Ambiguity occasionally arises: the code 86.2, for example, appears twice. In such situations, the tie is broken randomly.

Stroke-based ordering is quite complex, and Chinese readers have to work harder than we do to identify an item in an ordered list. It is easy to decide on the number of strokes, but once a page like Figure 8.12a is reached, most people simply scan it linearly. A strength of computer displays is that they can at least offer a choice of access methods.

The central navigation button of Figure 8.12a invokes the Pinyin browser in Figure 8.12b, which orders characters alphabetically by their Pinyin equivalent. The Pinyin codes for the titles are shown to the right of the figure, but again would not form part of the final display. Obviously, this arrangement is much easier for Westerners to comprehend.

Word segmentation is the problem of splitting a string of written language into its component words. … Dictionary-based and Machine learning approaches were used to split the compound words. This research also aims at evaluating the quality of a word segmentation by comparing it with the segmentation of reference.

Related Posts:

  • What is the difference between segment and phoneme?
    This provides one distinction between segments and… (Read More)
  • What are the 4 types of language?
    Another way to describe language is in… (Read More)
  • What is segmental features of English language?
    A given feature may be limited to… (Read More)
  • What are the 3 basic prosodic features?
    Intonation is referred to as a prosodic… (Read More)
  • What are the 5 prosodic features of speech?
    SuprasegmentalTone.Intonation.Stress.Pitch.Word accent.Falling intonation.Length.Rising intonation. (Read More)
  • What is an example of prosody?
    For example, prosody provides clues about attitude… (Read More)
  • What is pitch in prosodic features?
    Pitch refers to the perception of relative… (Read More)
  • What are the 4 levels of pitch?
    In the work of Trager and Smith… (Read More)
  • What is pitch in a voice?
    Pitch, in speech, the relative highness or… (Read More)
  • What is tone and pitch?
    Pitch: is a certain frequency that you… (Read More)

Понравилась статья? Поделить с друзьями:
  • What is word search game
  • What is word scramble game
  • What is word pyramid
  • What is word puzzle games
  • What is word puzzle game