The problem of word boundaries - Word и Excel - помощь в работе с программами

The importance of recognizing word boundaries is illustrated by this advertisement from the County Down Spectator.

In writing, word boundaries are conventionally represented by spaces between words. In speech, word boundaries are determined in various ways, as discussed below.

Related Grammatical and Rhetorical Terms

Assimilation and Dissimilation
Conceptual Meaning
Connected Speech
Intonation
Metanalysis
Mondegreen
Morpheme and Phoneme
Oronyms
Pause
Phonetics and Phonology
Phonological Word
Prosody
Segment and Suprasegmental
Slip of the Ear
Sound Change

Examples of Word Boundaries

«When I was very young, my mother scolded me for flatulating by saying, ‘Johnny, who made an odor?’ I misheard her euphemism as ‘who made a motor?’ For days I ran around the house amusing myself with those delicious words.» (John B. Lee, Building Bicycles in the Dark: A Practical Guide on How to Write. Black Moss Press, 2001
«I could have sworn I heard on the news that the Chinese were producing new trombones. No, it was neutron bombs.» (Doug Stone, quoted by Rosemarie Jarski in Dim Wit: The Funniest, Stupidest Things Ever Said. Ebury, 2008
«As far as input processing is concerned, we may also recognize slips of the ear, as when we start to hear a particular sequence and then realize that we have misperceived it in some way; e.g. perceiving the ambulance at the start of the yam balanced delicately on the top . . ..» (Michael Garman, Psycholinguistics. Cambridge University Press, 2000

Word Recognition

«The usual criterion for word recognition is that suggested by the linguist Leonard Bloomfield, who defined a word as ‘a minimal free form.’ . . .
«The concept of a word as ‘a minimal free form’ suggests two important things about words. First, their ability to stand on their own as isolates. This is reflected in the space which surrounds a word in its orthographical form. And secondly, their internal integrity, or cohesion, as units. If we move a word around in a sentence, whether spoken or written, we have to move the whole word or none of it—we cannot move part of a word.»
(Geoffrey Finch, Linguistic Terms, and Concepts. Palgrave Macmillan, 2000)
«[T]he great majority of English nouns begins with a stressed syllable. Listeners use this expectation about the structure of English and partition the continuous speech stream employing stressed syllables.»
(Z.S. Bond, «Slips of the Ear.» The Handbook of Speech Perception, ed. by David Pisoni and Robert Remez. Wiley-Blackwell, 2005)

Tests of Word Identification

Potential pause: Say a sentence out loud, and ask someone to ‘repeat it very slowly, with pauses.’ The pauses will tend to fall between words, and not within words. For example, the / three / little / pigs / went / to / market. . . .
Indivisibility: Say a sentence out loud, and ask someone to ‘add extra words’ to it. The extra item will be added between the words and not within them. For example, the pig went to market might become the big pig once went straight to the market. . . .
Phonetic boundaries: It is sometimes possible to tell from the sound of a word where it begins or ends. In Welsh, for example, long words generally have their stress on the penultimate syllable . . .. But there are many exceptions to such rules.
Semantic units: In the sentence Dog bites vicar, there are plainly three units of meaning, and each unit corresponds to a word. But language is often not as neat as this. In I switched on the light, the has little clear ‘meaning,’ and the single action of ‘switching on’ involves two words.
(Adapted from The Cambridge Encyclopedia of Language, 3rd ed., by David Crystal. Cambridge University Press, 2010)

Explicit Segmentation

«»[E]xperiments in English have suggested that listeners segment speech at strong syllable onsets. For example, finding a real word in a spoken nonsense sequence is hard if the word is spread over two strong syllables (e.g., mint in [mǀntef]) but easier if the word is spread over a strong and a following weak syllable (e.g., mint in [mǀntəf]; Cutler & Norris, 1988).
The proposed explanation for this is that listeners divide the former sequence at the onset of the second strong syllable, so that detecting the embedded word requires recombination of speech material across a segmentation point, while the latter sequence offers no such obstacles to embedded word detection as the non-initial syllable is weak and so the sequence is simply not divided.
Similarly, when English speakers make slips of the ear that involve mistakes in word boundary placement, they tend most often to insert boundaries before strong syllables (e.g., hearing by loose analogy as by Luce and Allergy) or delete boundaries before weak syllables (e.g., hearing how big is it? as how bigoted?; Cutler & Butterfield, 1992).
These findings prompted the proposal of the Metrical Segmentation Strategy for English (Cutler & Norris, 1988; Cutler, 1990), whereby listeners are assumed to segment speech at strong syllable onsets because they operate on the assumption, justified by distributional patterns in the input, that strong syllables are highly likely to signal the onset of lexical words. . . .
Explicit segmentation has the strong theoretical advantage that it offers a solution to the word boundary problem both for the adult and for the infant listener. . . .
«Together these strands of evidence motivate the claim that the explicit segmentation procedures used by adult listeners may in fact have their origin in the infant’s exploitation of
rhythmic structure to solve the initial word boundary problem.»
(Anne Cutler, «Prosody and the Word Boundary Problem.» Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition, ed. by James L. Morgan and Katherine Demuth. Lawrence Erlbaum, 1996)

Источник

The question relies on a number of unidentified assumptions about word boundaries, which are not totally alien but also are not obvious or obviously right. The main problem I see is the premise that there is this one thing, word boundary, that solves myriad problems.

The notion of there being a single «phonological tree» seems to be historically based on importing notions of structure from syntax (we wanted phonology to be more like syntax), but the properties of tree-like representations as used in syllable and foot structure are not the same as those employed in syntactic representations (prosodic structure is not seriously recursive in the way that syntactic trees are; phonological «trees» flout the single-mother convention). Attempting to align phonological grouping with morphosyntactic grouping just leads to tears, though that is not obvious if you consider just English. The problem is that combining a VC root with a VC prefix and a VC suffix typically leads to phonological V.C+V.C+VC, i.e. syllable boundaries seriously misaligned with morpheme boundaries.

In English and in contrast to other languages such as Arabic, there is not much evidence for resyllabification between words, so prosodic and syntactic constituency are not generally at odds. At the level of affixation, we do have mismatches involving V-initial suffixes (invite [ɪn.ˈvajʔ], invitee [ɪn.vaj.ˈtʰi]), but not at the phrasal level in e.g. «invite Igor». In asking about word boundaries in «the big house», «motorcycle» or «What are you going to do?», you have to have a theory of entities (are there both word and syllable boundaries? Are there also morpheme boundaries?), and what those entities do for you. Are there necessary or sufficient criteria for diagnosing «.», «+» or «#»?

The reason for positing word boundaries is usually syntactic: «the» is a word, it occupies a certain syntactic position, same with «big». We might claim that «motorcycle» has an internal word boundary because «motor» and «cycle» are words, and neither can reasonably be called a prefix or suffix. Phonologically speaking, there is nothing about «motorcycle» that demands a word boundary.

Certain concatenations that can be lumped together under the rubric «contraction», for example «going to» → «gonna», «will not» → «won’t», «got you» → «gotcha», also «Harry’s», behave phonologically more like affixational structures, even though they are syntactically more like word combinations. Just positing a readjustment of boundaries (removing the «#») does not solve all of the problems, especially in negative inflections (my analytic prejudice is now revealed).

The final complication in analyzing the aforementioned concatenations is that boundaries are also invoked to account for some facts of speech speech rhythm. The two syllables of «lighthouse» have a fixed rhythmic organization (prominence on the first syllable), but the phrase «light house» has variable rhythm (depends on whether you’re shopping for a light house vs a heavy house; or is the discussion about a house that is light vs. a hose that is light). Again, attempting to reduce these speech rhythm properties to nothing more than differences in word boundaries has proven to be futile. Once you introduce some other mechanism for encoding rhythmic distinctions, manipulations of word boundaries becomes unnecessary – we can just posit that word boundaries are there if and only if we syntactically concatenate two words. You still have to have an account of whether «won’t» is two syntactic words (as opposed to two syntactically-mandated functions manifested within a single word).

In other words, manipulating word boundaries has not proven to be a useful method of analysis.

Источник

@inproceedings{Cutler1996ProsodyAT,
  title={Prosody and the word boundary problem},
  author={Anne Cutler},
  year={1996}
}

The problem with word boundaries lies in locating them. In most spoken language, few cues are available to signal reliably where one word ends and the next begins. However, understanding spoken language must be a process of understanding discrete words rather than utterances as indivisible wholes, because most complete utterances have never previously been experienced by the listeners to whom they are directed. To understand a spoken utterance, therefore, listeners must somehow, in the absence…

80 Citations

A statistical learning algorithm for word segmentation

Jerry R. Van Aken
Computer Science

ArXiv
2011

A computer algorithm is described that is designed to solve the problem of locating word boundaries in blocks of English text from which the spaces have been removed and relies entirely on statistical relationships between letters in the input stream to infer the locations of word boundaries.

References

SHOWING 1-10 OF 46 REFERENCES

Источник

Posted by3 years ago

Archived

I have a masters in Linguistics but I’ve never gotten a satisfying answer for what a word is objectively. Wikipedia seems to suggest it’s mostly a result of writing systems, but I recall seeing it being used a lot as a concept in various branches of linguistics, e.g. word initial vs word final, analytic vs synthetic, lexemes vs morphemes, etc. It also seems to be pretty common to make a big deal out of languages with lots of agglutination, like the infamous Inuit words for snow or German’s reputation for long, intimidating words. So is there really a difference between «Friedensabkommen» and «peace agreement»? Is there any evidence for the existence of words? Can linguistics do without the concept or would a lot of the standard frameworks fall apart?

Источник

As you read this sentence, it is easy to tell where one word ends and another begins — it is shown by the spaces between them. But what about in spoken language? Although pauses exist in speech, usually at the ends of sentences, the vast majority of oral communication occurs without stopping. Turn on the television or radio to a station in an unfamiliar language and listen. It will likely sound like a flood of miscellaneous noises running together. An acoustic signal, in fact, will show no pauses. However, to a native speaker, that continuous stream sounds perfectly understandable.

Certain oral clues help identify the separation between words. While adults may learn to pick out words that translate to meaningful terms in their own language, young children cannot depend on this method when learning their first language. Dr. Gaja Jarosz, assistant professor of linguistics at Yale, investigates the various types of cues used for word segmentation. Previous linguistic studies have artificially manipulated characteristics of a speech stream to investigate possible cues to which children are sensitive, but Jarosz seeks to discover to what extent these properties are present in natural language input, as well as how informative they are for children learning languages.

What is Computational Linguistics?

While researchers can take various approaches, including fieldwork and the study of language systems in explicit detail, the subfield of computational linguistics seems more akin to computer science than any other so-called “softer” science. In particular, computational linguistics involves creation of statistical and computer models to simulate certain aspects of language such as grammar or, in Jarosz’s case, word boundaries. It is the study of the “machinery,” as Jarosz describes it, “or the formal system, that underlies our knowledge of language … how we acquire it and how we process it.” This encompasses sound structure, syntax, combinations of words, and meanings of utterances.

In her most recent project, Jarosz and former Yale undergraduate J. Alex Johnson employed computational linguistics to look at various properties of speech when adults speak to children, because this is the input for children first acquiring language. Computer models helped to discover how well these signals could predict and identify word boundaries. By analyzing phonetic transcriptions of English, Polish, and Turkish, Jarosz compared the different cues present across these three languages and their involvement in learning.

Differences in Language

Deeming two languages as similar or different depends on how they are compared. Jarosz chose to focus on Turkish, Polish, and English based on certain distinctive structural characteristics, particularly their morphology and syllable complexity. The morpheme is the basic unit of morphology, which is the grammatical structure of a language. For example, in English, the morpheme “–s” denotes singular versus plural (e.g. dog versus dogs), and the morpheme “–ed” is indicative of the past tense (e.g. play vs. played), but these indicators vary in different languages. Polish and Turkish, for example, both have more complicated morphologies than English, involving cases, verb endings, and other changes dependent on grammatical context — and Turkish has the most complicated morphology of all three languages.. Because children first learning a language have to parse these complexities and determine the distinction between words and morphemes, it would hypothetically be more difficult to learn a morphologically complicated language like Turkish.

In contrast, Turkish has the simplest syllables, which should be easier to pronounce. Mandarin, Japanese, and most African languages are alike in this respect. English contains rather complex syllables, with multiple consonants at the beginnings or ends of words while Polish and Slavic languages in general, have even larger, more difficult ones.

Within languages, child-directed speech contains important differences from adult-to-adult speech. The formation of vowel sounds, for example, can be plotted acoustically in a 2-D plane to portray “vowel space.” This plot is made by putting a measurement of how far forward the tongue is (from the front to back of the mouth) on one axis and its height on the other. Jarosz explains, “When we talk to kids, the space gets stretched out more toward the extremes.” She studies this modified speech, as opposed to ordinary speech, because it is the specific input to which children are exposed to during the language acquisition process.

In spite of all these possible variations in morphology, syllable complexity, and other linguistic properties of languages, multilingual children seem to have no problem learning multiple languages, even if they are morphologically different. After a certain point, however, it becomes harder to learn a language fluently. Jarosz attributes this to what she calls “unlearning.” To illustrate this point, Jarosz pronounced a pair of very similar-sounding Polish syllables that young children can distinguish easily. “Children learning English after about a year or even less will learn not to distinguish them,” she says. “They will unlearn this difference and start to put them together in a single category.” In contrast, children learning Polish will maintain the ability to differentiate between the two sounds because they exist as two different categories in that language.

Distributional Cues

Across the three markedly different languages of English, Turkish, and Polish, Jarosz studied the predictive capability of 176 cues for word boundaries, such as stress patterns and transitional probability. Languages tend to put stresses near the edges of words, at the beginning, end, or second from beginning/ end, but children are not aware of this relationship when they first start learning languages. They have to learn, for instance, that the stressed syllable is at the beginning of the word in English but at the end in French. “The sorts of cues that children ultimately use to figure out where the word boundaries are create a kind of chicken and egg problem for the learner,” says Jarosz. While children need to know the word boundaries to know where the stress is, it seems that they also need to know where stress is to identify word boundaries.

Another boundary cue, the transitional probability, typically assesses the probability of seeing a certain phoneme (the basic, distinctive sound unit in language) given the previous one. For instance, what is the likelihood of seeing “o” next, given “d,” as in the word “dog”? The probability should be higher within words because the sounds/phonemes that make up the word always go together, but it should be lower at word boundaries since the next word can begin with any sound. Thus, dips in probability should predict the pattern of word segmentation. This property, among others, can also be calculated in either the forward or reverse direction, the latter instead looking at the likelihood of seeing a certain phoneme given the subsequent one. This time, given “o,” what is the chance of seeing “d” before it, as in “dog”?

Across all three languages, Jarosz found that the best cues were calculated in the reverse direction, which corroborates recent experimental findings that infants can track this information. However, the cues were informative to different extents in the different languages. The greatest predictor in English was boundary-predicting backwards phoneme-level trigram probability — the probability that the previous phoneme would be an indicator of word boundary, given two subsequent phonemes, instead of one. However, the same trend was not evident in Polish or Turkish.

Jarosz emphasizes the most important finding: while at least some individual cues in English are decently informative, no single one in Turkish or Polish is good by itself for learning about word segmentation. In other words, some of the 176 cues (stress, transitional probability, etc.) by themselves are somewhat useful in determining breaks between words in English, but none of them are individually helpful in the other two languages studied. When multiple cues are combined, though, such a clear advantage for English disappears. Thus, children must be paying attention to more than one word segmentation cue.

With this finding in mind, the next step is to figure out how children are integrating multiple cues to find the location of word boundaries in languages such as Turkish and Polish. Understanding this process is also a method to improve computational models of word segmentation for languages other than English. “A lot of models have been tested on English in particular but don’t work quite so well on other languages,” says Jarosz. “We have to make sure that we know how children do this in other languages as well — because obviously they do learn other languages, too!”

A language world map such as this one depicts the location of various language families. Courtesy of Freelang.

About the Author
Nancy Huynh is a junior Molecular, Cellular, and Developmental Biology major in Silliman College. She works in Dr. Barbara Kazmierczak’s lab studying the impact of antibiotics on the gut microbiome and vaccination response.

Acknowledgments
The author would like to thank Professor Jarosz for taking the time to explain her linguistics research.

Источник