- Introduction to Chinese Word Segmentation
- What is participle
- What are the word segmentation algorithms
- What is a good word segmentation algorithm
- Based on matching rules
- Forward-max matching
- Backward-max matching
- Bi-direction Matching
- Based on probability statistics
- Language model
- HMM/CRF
Tell a story
Sun / according to the incense burner / raw / purple smoke
Rizhao/ Incense Burner/ Raw/ Purple Smoke
Let’s learn the word segmentation algorithm together.
Introduction to Chinese Word Segmentation
What is participle
Borrowing the definition of Baidu Encyclopedia: Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications
What are the word segmentation algorithms
Here are roughly divided into two categories according to the method of word segmentation: rule-based word segmentation and statistics-based word segmentation
- Rule-based word segmentation
- Forward maximum match
- Backward max match
- Minimal segmentation (to minimize the number of words cut out in each sentence)
- Two-way maximum match
- Based on statistics
- Language model
- HMM
- CRF
- Deep learning
What is a good word segmentation algorithm
This involves the design principles of word segmentation algorithm:
-
The larger the particle size, the better
-
The fewer non-dictionary words in the segmentation result, the better, and the fewer words in the single-character dictionary, the better
-
The less the overall word count, the better
Based on matching rules
Forward-max matching
Example: [Rizhao incense burner produces purple smoke]
Dictionary: ["日","日照","香炉","照香炉","生","紫烟"]
Suppose we set the maximum length of words to 5, let’s see how to proceedForward maximum match
first round:
«Rizhao incense burner Sheng», there is no match for this word in the dictionary
«Rizhao Incense Burner», unmatched
«Rizhaoxiang», unmatched
«Sunshine», matching
second round:
«The incense burner produces purple smoke», unmatched
«The incense burner produces purple», unmatched
«Censer Sheng», not matched
«Censer», match
The third round:
«Purple Smoke», unmatched
«Purple», no match
«Born», match
Fourth round:
«Purple Smoke», match
The final word segmentation result is: Rizhao/ Incense Burner/ Sheng Ziyan
Code:
#Forward max match
def forward_max_matching(text,maxlen,vocab):
results = []
while text:
#Take the maximum length substring to match
if len(text)<maxlen:
subtext = text
else:
subtext = text[0:maxlen]
while subtext:
if subtext in vocab:
results.append(subtext)
text = text[len(subtext):]
break
else:
subtext = subtext[0:len(subtext)-1]
return results
Backward-max matching
Example: [Rizhao incense burner produces purple smoke]
Dictionary: ["日","日照","香炉","照香炉","生","紫烟"]
Suppose we also set the maximum length of words to 5, let’s see how to proceedBackward max match
first round:
«The incense burner produces purple smoke», unmatched
«The furnace produces purple smoke», not matched
«Purple Smoke», unmatched
«Purple Smoke», match
second round:
«Rizhao Incense Burner», unmatched
«According to the incense burner», unmatched
«Censer Sheng», not matched
«Furnace», not matched
«Born», match
The third round:
«Rizhao Incense Burner», unmatched
«According to the incense burner», matching
Fourth round:
«Day», matching
The final word segmentation result is: day / according to incense burner / raw / purple smoke
It is found that the two word segmentation results are different! !
Code:
#Backward Maximum Match
def backward_max_matching(text,maxlen,vocab):
results = []
while text:
#Take the maximum length substring to match
if len(text)<maxlen:
subtext = text
else:
subtext = text[-maxlen:]
while subtext:
if subtext in vocab:
results.append(subtext)
text = text[:-len(subtext)]
break
else:
subtext = subtext[-(len(subtext)-1):]
return results[::-1]
Bi-direction Matching
Compare the forward maximum matching algorithm with the backward maximum matching algorithm to determine the correct word segmentation method
Algorithm flow:
- Compare the results of forward maximum matching and reverse maximum matching
- If the number of word segmentation results are different, then the one with the less number of word segmentation
- If the number of word segmentation results are the same
- Same word segmentation results, any one can be returned
- Different word segmentation results, return the one with fewer words
def bidirection_matching(text,maxlen,vocab):
results = []
forward = forward_max_matching(text,maxlen,vocab)
backward = backward_max_matching(text,maxlen,vocab)[::-1]
# The number of words in the front and back results is different, and the number of words returned is small
if len(forward)!=len(backward):
return forward if len(forward)<len(backward) else backword
else:
#The number of word segmentation is the same, the result of segmentation is the same
if forward == backward:
return forward
else:#Different word segmentation results, return the one with fewer words
for_single = [word for word in forward if len(word)==1]
back_single = [word for word in backward if len(word)==1]
if len(for_single) < len(back_single):
return forward
else:
return backward
Intuitively, how to get a good word segmentation result?
Enter text --> find all possible splits --> choose the best result
It can be seen that in the rule-based matching method, the results of word segmentation are all local optimal solutions. More importantly, this word segmentation method does not consider the semantic information of the sentence. How to choose the best from all possible word segmentation results, which requires the presence of a language model
Based on probability statistics
Language model
The language model is a model used to calculate the probability of a sentence, that is, the probability of judging whether a sentence is a human speech?
Given sentence
[S=W_1,W_2,…,W_K ]
The probability can be expressed as
[p(S) = p(W_1,W_2,…,W_K) ]
Here, according to the Markov hypothesis, a word can be specified to depend on the previous words. Here we assume that the appearance of each word is independent of each other, that is, the unary language model, so the probability is expressed as
[p(S) = p(W_1,W_2,…,W_K)=p(W_1)p(W_2)…p(W_K) ]
We know that relative to the entire corpus, a word has a very low probability of appearing. Multiplying multiple decimals may result in -inf, so the logarithm becomes an addition, and the largest result is the best segmentation result.
There is another problem here. According to what I said earlier, the process of generating all input word segmentation results is too inefficient. We need a way to integrate the process of generating word segmentation and calculating probability, using probability graphs! !
The above figure is a simple schematic, each line segment represents a character or word, and p represents the probability of the corresponding word in the dictionary database. If we use the unary language model, all we need to do is to find the path with the smallest probability product and use dynamic programming to achieve the shortest path.
HMM/CRF
Use sequence labeling method to solve word segmentation and label each word:
B (beginning), M (middle), E (end), S (independently formed words) four states
That’s it. When you actually use it, you need to make a choice according to the scene. For example, when the search engine analyzes the content of a large-scale web page, the speed of word segmentation is greater than the accuracy. The accuracy requirement is greater than the speed.
references
Matching method.https://blog.csdn.net/selinda001/article/details/79345072
Overview of Chinese word segmentation.https://zhuanlan.zhihu.com/p/67185497
Overview of word segmentation algorithms.https://zhuanlan.zhihu.com/p/50444885
InfoSci-OnDemand
Download Premium Research Papers
Full text search our database of 176,600 titles for Word Segmentation to find related research papers.
Learn More About Word Segmentation in These Related Titles
Copyright 2021. 336 pages.
Every industry will be positively affected by bloc…
In Stock
Copyright 2020. 246 pages.
The dissertation processes across various discipli…
In Stock
Copyright 2017. 365 pages.
Data visualization has emerged as a serious schola…
In Stock
Copyright 2016. 676 pages.
Special interest in topics relating to library man…
In Stock
Copyright 2016. 360 pages.
Uncovering the best methods for conducting and wri…
In Stock
Copyright 2015. 582 pages.
For faculty to advance their careers in higher edu…
In Stock
Copyright 2012. 312 pages.
Without question, reference collections have chang…
In Stock
Copyright 2012. 754 pages.
“Resource discovery” has many meanings, and it is…
In Stock
Copyright 2011. 1730 pages.
Knowledge Management has evolved into one of the m…
In Stock
Copyright 2008. 270 pages.
In many organizations, information technology (IT)…
In Stock
What is tokenization?
Tokenization involves breaking text into individual words, making it easier for computers to understand and analyze meaning. This task applies to various Natural Language Processing (NLP) applications such as language translation, text summarization, and sentiment analysis.
In this post, we will explore using SentencePiece, a widely used open-source library for tokenization in Python.
Setup
Before diving into the implementation, we have to download the IMDB Movie Reviews Dataset, which can be found here.
Once the dataset is downloaded, consolidate the reviews into a single file, named reviews.txt
, with each review on a separate line.
Unsupervised segmentation
SentencePiece
SentencePiece is an open-source library that allows for unsupervised tokenization.
Unlike traditional tokenization methods, SentencePiece is reversible. It can reconstruct the original text given a dictionary of known words and a sequence of token IDs. This means there is no information loss.
SentencePiece is also grammar-agnostic, meaning it can learn a segmentation model from raw bytes. This makes it suitable for many different text types, including logographic languages like Chinese. SentencePiece eliminates the need for large language models, and updating new terms is a breeze.
To use SentencePiece for tokenization in Python, you must first import the necessary modules. If you do not have sentencepiece installed, use pip install sentencepiece
.
import os import sentencepiece as spm
Once you have the necessary modules imported, you can use SentencePiece to train a model on your text data. The following code will train a model on the “reviews.txt” file and save it as “reviews.model”:
if not os.path.isfile('reviews.model'): spm.SentencePieceTrainer.Train('--input=reviews.txt --model_prefix=reviews --vocab_size=20000 --split_by_whitespace=False') sp = spm.SentencePieceProcessor() sp.load('reviews.model')
Once you have trained your SentencePiece model, you can use it to tokenize new text. Here’s an example of how to use the SentencePiece tokenizer model for tokenizing an unseen sentence:
sentence = "The quick brown fox jumps over the lazy dog" sp.EncodeAsPieces(sentence) >> ['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over▁the', '▁lazy', '▁dog'] sp.EncodeAsIds(sentence) >> [25, 3411, 11826, 5786, 8022, 2190, 11302, 2048]
Traditional segmentation
NLTK
One of the most well-known and widely used libraries is the Natural Language Toolkit (NLTK). Initially released in 2001, NLTK was designed for research and is considered a powerful tool for many NLP tasks, such as parse trees. However, it is not typically considered “production-ready”.
To use NLTK for tokenization, you will need first to download the necessary packages, such as the punkt package:
Once you have the required packages installed, you can use the word_tokenize
function to tokenize the text into individual words:
from nltk.tokenize import word_tokenize tokens = word_tokenize(text)
SpaCy
Another popular library for tokenization in natural language processing and machine learning tasks is SpaCy. This library is considered industrial-strength and has a user-friendly API, making it easy to perform everyday NLP tasks such as entity recognition.
To use SpaCy for tokenization, you will first need to download the necessary packages by running the following command:
python -m spacy download en
Once the packages are installed, you can use the following code to tokenize the text into individual words:
import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) tokens = [token.text for token in doc]
stanfordnlp
Another popular library for tokenization in natural language processing and machine learning tasks is stanfordnlp.
This library is built on the Java-based CoreNLP library and offers advanced features such as a Bi-Directional LSTM Network with GPU acceleration.
To use stanfordnlp for tokenization, you will need to download the necessary packages by running the following command:
stanfordnlp.download('en')
Once the packages are installed, you can use the following code to tokenize the text into individual words:
import stanfordnlp nlp = stanfordnlp.Pipeline() doc = nlp(text) print([word.text for sentence in doc.sentences for word in sentence.tokens])
Other libraries
Other popular libraries, such as TextBlob, flair, Intel’s NLP Architect, and AllenNLP, don’t use their own segmentation methods. Gensim provides a tokenization function in their utils module as a convenience, but the results are alphanumeric only and lowercase, similar to the regular expression previously mentioned.
Most algorithm-specific libraries use other methods; for example, BERT uses an import of a previously made vocab file.
Basic methods
When it comes to basic segmentation methods in Python, we can use the split()
function for basic segmentation tasks.
Split method
First, we’ll pick a random movie review and then use the split()
function to separate the text into individual words:
import re, random reviews = open('reviews.txt').readlines() text = random.choice(reviews) words = text.split() print(words)
The resulting output will be a list of words, separated by the default delimiter, which is whitespace:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema.’, …]
The split() method is a simple approach, but it has its limitations. Punctuation is mixed in with regular words, which makes it less accurate.
Regular expressions
Another method for tokenization is using regular expressions, which allows for more control and specificity.
Here’s an example:
words = re.findall('w+',text) print(words)
The resulting output will be a list of words, with punctuation and other non-alphanumeric characters removed:
[‘This’, ‘film’, ‘is’, ‘undoubtedly’, ‘one’, ‘of’, ‘the’, ‘greatest’, ‘landmarks’, ‘in’, ‘history’, ‘of’, ‘cinema’, …]
This result isn’t too bad, but there are still some problems. This method can result in information loss due to the absence of punctuation and other non-alphanumeric characters, especially in UTF-8.
This method also does not consider words that contain spaces, such as “New York City,” which should be regarded as one word.
Benchmark comparison
To demonstrate the efficiency of SentencePiece, we can compare its performance to other popular tokenization methods.
The following plot shows the tokenization time of 1,000 movie reviews using an AMD Ryzen 7 CPU and 16GB DDR4 on Xubuntu.
Hi, I’m Nathan. Thanks for reading! Keep an eye out for more content being posted soon.
Participle is NLP The basic task is to decompose sentences and paragraphs into word units to facilitate the analysis of subsequent processing.
This article will introduce the reasons for word segmentation, the 3 difference between Chinese and English word segmentation, the 3 difficulty of Chinese word segmentation, and the typical 3 method of word segmentation. Finally, the tools commonly used for Chinese word segmentation and English word segmentation will be introduced.
To learn more about NLP-related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
Visit the NLP topic and download a 59-page free PDF
What is a participle?
Participle is Natural Language Understanding – NLP Important steps.
Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.
Why are you dividing words?
1. Turning complex problems into mathematical problems
On Machine learning article As mentioned in it, machine learning seems to solve many complicated problems because it translates these problems into mathematical problems.
NLP is the same idea. Text is some «unstructured data». We need to convert this data into «structured data» first. Structured data can be transformed into mathematics, and word segmentation is the first transformation. step.
2. word is a more suitable granularity
A word is the smallest unit that expresses the full meaning.
The granularity of the word is too small to express the full meaning. For example, «rat» can be «mouse» or «mouse».
The granularity of sentences is too large, and the amount of information carried is large, which is difficult to reuse. For example, «the traditional method of word segmentation, an important reason is that the traditional method has a weak ability to model long-distance dependence.»
3. In the era of deep learning, some tasks can also be «segmented»
In the era of deep learning, with the explosive growth of data volume and computing power, many traditional methods have been subverted.
The word segmentation has always been the basis of NLP, but it is not necessarily the case now. If you are interested, you can check out this paper:Is Word Segmentation Necessary for Deep Learning of Chinese Representations?》.
However, in some specific tasks, participles are still necessary. Such as: keyword extraction, named entity recognition, etc.
3 typical differences between Chinese and English participles
Differentiating 1: Different ways of word segmentation, Chinese is more difficult
English has a natural space as a separator, but Chinese does not. So how to divide is a difficult point, plus the fact that there is a lot of meaning in Chinese, which makes it easy to be ambiguous. The difficulties in the following sections will be explained in detail.
Differentiating 2: English words have multiple forms
There are rich transformations in English words. In order to deal with these complex transformations, English NLP has some unique processing steps compared to Chinese, which we call Lemmatization and stem extraction (Stemming). Chinese does not need
Part of speech restoration: does, done, doing, did need to be restored to do by part of speech restoration.
Stem extraction: cities, children, teeth These words need to be converted to city, child, tooth»
Differentiating 3: Chinese word segmentation needs to consider the granularity problem
For example, the «University of Science and Technology of China» has a variety of divisions:
- University of Science and Technology of China
- ChinaScience and TechnologyUniversity
- ChinaScienceTechnologyUniversity
The larger the granularity, the more accurate the meaning of the expression, but it will also result in fewer recalls. So Chinese needs different scenarios and requires different granularities. This is not in English.
3 big difficulty of Chinese word segmentation
Difficult 1: There is no uniform standard
There is no uniform standard for Chinese word segmentation at present, and there is no recognized norm. Different companies and organizations have their own methods and rules.
Difficult 2: How to distinguish ambiguous words
For example, «the auction of the table tennis ball» has 2 word segmentation to express the different meanings of 2:
- Table tennis auction finished
- Ping Pong Racket Selling Finished
Difficult 3: Identification of new words
In the era of information explosion, a bunch of new words will emerge at the end of the three days. How to quickly identify these new words is a major difficulty. For example, when the «blue thin mushroom» fire was in the past, it needed to be quickly identified.
3 typical word segmentation method
The method of word segmentation is roughly divided into 3 classes:
- Dictionary based matching
- Based on statistics
- Deep learning
Give the dictionary a matching word segmentation
Advantages: fast speed and low cost
Disadvantages: not adaptable, large difference in effect in different fields
The basic idea is based on dictionary matching. The Chinese text of the word to be segmented is divided and adjusted according to certain rules, and then matched with the words in the dictionary. If the matching is successful, the word segmentation according to the dictionary is used. If the matching fails, the adjustment is repeated or re-selected. Just fine. Representative methods are based on forward maximum matching and based on inverse maximum matching and two-way matching.
Statistical segmentation method
Advantages: strong adaptability
Disadvantages: higher cost and slower speed
The current commonly used algorithm isHMM, CRF,SVMDeep learningAlgorithms such as stanford and Hanlp word segmentation tools are based on the CRF algorithm. Taking CRF as an example, the basic idea is to mark the Chinese characters. It not only considers the frequency of occurrence of words, but also considers the context and has good learning ability. Therefore, it has good effects on the identification of ambiguous words and unregistered words.
Deep learning
Advantages: high accuracy and adaptability
Disadvantages: high cost and slow speed
For example, someone tries to use two-wayLSTM+CRF implements a tokenizer, which is essentially a sequence label, so it can be used for versatility, named entity recognition, etc. It is reported that its word breaker character accuracy can be as high as 97.5%.
Common word breakers use a combination of machine learning algorithms and dictionaries to improve segmentation accuracy on the one hand and domain adaptability on the other.
The ranking below is ranked according to the number of stars on GitHub:
- Hanlp
- Stanford participle
- Ansj word breaker
- Harbin Institute of Technology LTP
- KCWS word breaker
- Jieba
- IK
- Tsinghua University THULAC
- ICTCLAS
English word segmentation tool
- Hard
- spacy
- Gensim
- NLTK
Final Thoughts
Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.
Reasons for word segmentation:
- Turn complex problems into mathematical problems
- Word is a more appropriate granularity
- In the era of deep learning, some tasks can also be «divided»
Typical differences between 3 in Chinese and English participles:
- Different ways of word segmentation, Chinese is more difficult
- English words have multiple forms, requiring part of speech restoration and stemming
- Chinese word segmentation needs to consider the granularity problem
3 big difficulty of Chinese word segmentation
- No uniform standard
- How to distinguish ambiguous words
- New word recognition
3 typical word segmentation:
- Dictionary based matching
- Based on statistics
- Deep learning
Baidu Encyclopedia + Wikipedia
Baidu Encyclopedia version
Chinese word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. We know that in English, words are spaces with natural delimiters, while Chinese is just words, sentences and paragraphs that can be delimited by explicit delimiters. Only words have no formal delimiter. Although English also has the problem of the division of phrases, at the level of words, Chinese is much more complicated and much more difficult than English.
Read More
Wikipedia version
ParticipleIt is the process of dividing and possibly classifying a series of input characters. The resulting mark is then passed to some other form of processing. This process can be thought of as a subtask that parses the input.
Read More
Spaces are a relatively recent invention in Western languages. Although ancient Hebrew and Arabic used spaces to separate words, partly to compensate for the lack of vowels, they were not used in Latin until 600 to 800 A.D. When the Latin alphabet was adopted for English, it was written scripta continua, without any word separators. Later, centered dots were added to make reading easier, and subsequently the dots were replaced with spaces. Today, languages in the CJK (Chinese-Japanese-Korean) family are written without using any spaces or other word delimiters; so are the Thai, Khmer (Cambodian), and Lao languages. This introduces the need for segmentation algorithms to separate words for indexing.
Segmenting words
Finding word boundaries in the absence of spaces is a non-trivial problem, and ambiguities often arise. To help you appreciate the problem, Figure 8.11a shows two interpretations of the same Chinese characters. The text is a play on the ambiguity of phrasing. Once upon a time, the story goes, a man embarked on a long journey. Before he could return home the rainy season began, and he took shelter at a friend’s house. As the rains continued he overstayed his welcome, and his friend wrote him a note: the first line in Figure 8.11a. As shown in the second line, it reads «It is raining, the god would like the guest to stay. Although the god wants you to stay, I do not!» But before taking the hint and leaving, the visitor added the punctuation shown in the third line, making three sentences whose meaning is totally different—»The rainy day, the staying day. Would you like me to stay? Sure!»
This is an example of ambiguity related to phrasing, but ambiguity also can arise with word segmentation. Figure 8.11b shows a more prosaic example. For the ordinary sentence on the first line, there are two different interpretations, depending on the context: «I like New Zealand flowers» and «I like fresh broccoli.»
Written Chinese documents are unsegmented, and readers are accustomed to inferring the corresponding sequence of words almost unconsciously. Accordingly, machine-readable text is usually unsegmented.
Figure 8.11: Alternative interpretations of two Chinese sentences: (a) ambiguity caused by phrasing; (b) ambiguity caused by word boundaries
To render them suitable for full-text retrieval, a segmentation scheme should be used to insert word boundaries at appropriate positions prior to indexing.
One segmentation method is to use a language dictionary. Boundaries are inserted to maximize the number of the words in the text that are also present in the dictionary. Of course, there may be multiple valid segmentations, and heuristics are needed to resolve ambiguities.
Another method is based on the fact that text divided into words is more compressible than text that lacks word boundaries. You can demonstrate this with a simple experiment. Take a text file, compress it with any standard compression utility (such as gzip), and measure the compression ratio. Then remove all the spaces from the file, making it considerably smaller (about 17 percent smaller, because in English approximately one letter in six is a space). When you compress this smaller file, the compression ratio is noticeably worse than for the original file. Inserting word boundaries improves compressibility.
This fact can be used to divide text into words, based on a large corpus of hand-segmented training data. Between every two characters lies a potential space. A text compression model can be trained on presegmented text, and coupled with a search algorithm to interpolate spaces to maximize the overall compression. Section 8.5 («Notes and sources») at the end of this topic points to a fuller explanation of the technique.
For non-Chinese readers, the success of the space-insertion technique can be illustrated by applying it to English. Table 8.4 shows original text at the top, complete with spaces. Below is the input to the segmentation procedure. Underneath that is the output of two segmentation schemes: one dictionary-based and the other compression-based. The training text was a substantial sample of English, although far smaller than the corpus used to produce the word dictionary.
Word-based segmentation fails badly when the words are not in the dictionary. In this case both cro-cidolite and Micronite are segmented incorrectly. In addition, inits is treated as a single word because it occurred that way in the text from which the dictionary was created, and in cases of ambiguity the algorithm prefers longer words. The strength of the compression-based method is that it performs well on unknown words. Although Micronite does not occur in the training corpus, it is correctly segmented. The compression-based method makes two errors, however. First, a space was not inserted into LoewsCorp because it happens to require fewer bits to encode than Loews Corp. Second, an extra space was added to crocidolite because that also reduced the number of bits required.
Table 8.4: Segmenting words in English text
Original text |
the unit of New York-based Loews Corp that makes Kent cigarettes stopped using crocidolite in its Micronite cigarette filters in 1956. |
Without spaces |
the unit of New York-based Loews Corp that makes Kent cig arettesstoppedusingcrocidoliteinitsMicronitecigarettefiltersin1956. |
Word-based segmentation |
the unit of New York-based Loews Corp that makes Kent cigarettes stopped using c roc id o lite inits Micron it e cigarette filters in 1956. |
Character-based segmentation |
the unit of New York-based LoewsCorp that makes Kent cigarettes stopped using croc idolite in its Micronite cigarette filters in 1956. |
Segmenting words in Thai/Khmer/Lao
Thai, Khmer, and Lao are other languages that do not use spaces between the words in a sentence, although they do include spaces between phrases and sentences. Unlike the CJK family, which uses ideographs, they are alphabetic languages. However, they are easier to read than English would be if spaces were omitted, because English provides fewer clues about word breaks.
In Thai, for example, there are many rules that govern where words can begin and end. Thai includes a «silence marker» called gaaran—the little symbol that appears above the last letter in the word which indicates that the letter or letters underneath are not pronounced in the usual Thai pronunciation. Here are some of the rules:
• gaaran ends a word (except for some European loan words, such asfor golf); •ends a word (except in the rare cases when it is followed by a consonant with the gaaran symbol, as in
•ends a word;
• the vowelsstart a word (these are called «preposed» vowels, and are written before their accompanying consonant).
These rules help to determine some word boundaries, but not all. Sometimes ambiguities arise. For example, depending on how it is segmented, means (roughly) breath of fresh airor round eyesAs with Chinese, there are several possible approaches to determining word boundaries for searching.
Sorting Chinese text
Several different schemes underlie printed Chinese dictionaries and telephone directories. Characters can be ordered according to the number of strokes they contain; or they can be ordered according to their radical, which is a core symbol on which they are built; or they can be ordered according to a standard alphabetical representation called Pinyin, where each ideograph is given a one- to six-letter equivalent. Stroke ordering is probably the most natural way of ordering character strings for Chinese users, although many educated people prefer Pinyin (not all Chinese people know Pinyin). This presents a problem when creating lists of Chinese text that are intended for browsing.
To help you appreciate the issues involved, Figure 8.12 shows title browsers for a large collection of Chinese documents. The rightmost button on the green access bar near the top invokes Figure 8.12a. Here, titles are ordered by the number of strokes in their first character, which is given across the top: the user has selected six. In all the titles that follow, the first character has six strokes. This is probably not obvious from the display, because you can only count the strokes in a character if you know how to write it.
Figure 8.12: Browsing a list of titles in Chinese: (a) stroke-based browsing; (b) Pinyin browsing
To illustrate this, the initial characters for the first and seventh titles are singled out and their writing sequence is superimposed on the screen shot: the first stroke, the first two strokes, the first three strokes, and so on, ending with the complete character, which is circled. All people who read Chinese know immediately how many strokes are needed to write any particular character.
There are generally more than 200 characters corresponding to a given number of strokes, almost any of which could occur as the first character of a title. Hence the titles in each group are displayed in a particular conventional order, again determined by their first character. This ordering is more complex. Each character has an associated radical, the basic structure that underlies it. For example, the radical in the upper example singled out in Figure 8.12a (the first title) is the pattern corresponding to the initial two strokes, which in this case form the left-hand part of the character. Radicals have a conventional ordering that is well known to educated Chinese; this particular one is number 9 in the Unicode sequence. Because this character requires four more strokes than its radical, it is designated by the two-part code 9.4. In the lower example singled out in Figure 8.12a (the seventh title), the radical corresponds to the initial three strokes, which form the top part of the character, and is number 40; thus this character receives the designation 40.3. These codes are shown to the right of Figure 8.12a but would not form part of the final display.
The codes form the key on which titles are sorted. Characters are grouped first by radical number, then by how many strokes are added to the radical to make the character. Ambiguity occasionally arises: the code 86.2, for example, appears twice. In such situations, the tie is broken randomly.
Stroke-based ordering is quite complex, and Chinese readers have to work harder than we do to identify an item in an ordered list. It is easy to decide on the number of strokes, but once a page like Figure 8.12a is reached, most people simply scan it linearly. A strength of computer displays is that they can at least offer a choice of access methods.
The central navigation button of Figure 8.12a invokes the Pinyin browser in Figure 8.12b, which orders characters alphabetically by their Pinyin equivalent. The Pinyin codes for the titles are shown to the right of the figure, but again would not form part of the final display. Obviously, this arrangement is much easier for Westerners to comprehend.
Word segmentation is the problem of splitting a string of written language into its component words. … Dictionary-based and Machine learning approaches were used to split the compound words. This research also aims at evaluating the quality of a word segmentation by comparing it with the segmentation of reference.
Related Posts:
- What is the difference between segment and phoneme?
— This provides one distinction between segments and… (Read More) - What are the 4 types of language?
— Another way to describe language is in… (Read More) - What is segmental features of English language?
— A given feature may be limited to… (Read More) - What are the 3 basic prosodic features?
— Intonation is referred to as a prosodic… (Read More) - What are the 5 prosodic features of speech?
— SuprasegmentalTone.Intonation.Stress.Pitch.Word accent.Falling intonation.Length.Rising intonation. (Read More) - What is an example of prosody?
— For example, prosody provides clues about attitude… (Read More) - What is pitch in prosodic features?
— Pitch refers to the perception of relative… (Read More) - What are the 4 levels of pitch?
— In the work of Trager and Smith… (Read More) - What is pitch in a voice?
— Pitch, in speech, the relative highness or… (Read More) - What is tone and pitch?
— Pitch: is a certain frequency that you… (Read More)