Word boundaries что это такое

я столкнулся с еще более серьезной проблемой при поиске текста для таких слов, как .NET,C++,C# и C. Вы могли бы подумать, что компьютерные программисты будут знать лучше, чем называть язык, для которого трудно писать регулярные выражения.

во всяком случае, это то, что я узнал (в основном изhttp://www.regular-expressions.info, что является отличным сайтом): в большинстве вариантов регулярных выражений символы, которые соответствуют короткой руке класс символов w символы, которые рассматриваются как символы слова по границам слов. Java является исключением. Java поддерживает Unicode для b а не w. (Я уверен, что в то время для этого была веская причина).

The w означает «символ слова». Он всегда соответствует символам ASCII [A-Za-z0-9_]. Обратите внимание на включение подчеркивания и цифр (но не тире!). В большинстве вкусов, которые поддерживают Unicode,w включает в себя множество персонажей из других файлы сценариев. Существует много несоответствий о том, какие символы на самом деле включены. Буквы и цифры из алфавитных сценариев и идеограмм, как правило, включены. Знаки препинания соединителя, отличные от подчеркивания и числовых символов, которые не являются цифрами, могут быть включены или не включены. XML-схема и XPath даже включают все символы в w. Но Java, JavaScript и PCRE соответствуют только символам ASCII с w.

именно поэтому регулярное выражение на основе Java ищет C++, C# или .NET (даже если вы помните, чтобы избежать периода и плюсы) облажались b.

примечание: Я не уверен, что с этим делать ошибки в тексте, например, когда кто-то не ставит пробел после точки в конце предложения. Я допускал это, но я не уверен, что это обязательно правильно.

в любом случае, в Java, если вы ищете текст для этих странных именованных языков, вам нужно заменить b до и после пробельные символы и обозначения пунктуации. Например:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "n" + line;
        }
    }
    return result.trim();
}

затем в тесте или основной функции:

    String beforeWord = "(s|.|,|!|?|(|)|'|\"|^)";   
    String afterWord =  "(s|.|,|!|?|(|)|'|\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("b.NETb", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+".NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("bC#b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("bC++b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C++"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("bJavab", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)bjavab", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("bCb", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("bCb", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

P. S. Моя спасибо http://regexpal.com/ без кого выражения мир был бы очень несчастен!

Word boundary
Граница слова, словораздел.

Краткий толковый словарь по полиграфии.
2010.

Смотреть что такое «Word boundary» в других словарях:

  • word boundary — žodžio riba statusas T sritis automatika atitikmenys: angl. word boundary vok. Wortgrenze, f rus. граница слова, f pranc. limite du mot, f …   Automatikos terminų žodynas

  • word boundary — žodžio riba statusas T sritis informatika apibrėžtis Adresas kompiuterio atmintyje, ties kuriuo gali prasidėti kompiuterinis ↑ žodis (2). Pavyzdžiui, jegu skaičiui skiriama 4 baitai, tai skaičių žodžiai gali prasidėti adresais, kurie dalūs iš 4.… …   Enciklopedinis kompiuterijos žodynas

  • word boundary — žodžio riba statusas T sritis informatika apibrėžtis Vieta ↑eilutėje, ties kuria prasideda arba baigiasi žodis. atitikmenys: angl. word boundary ryšiai: dar žiūrėk – eilutė dar žiūrėk – eilutė dar žiūrėk – eilutė …   Enciklopedinis kompiuterijos žodynas

  • Word — A word is a unit of language that carries meaning and consists of one or more morphemes which are linked more or less tightly together, and has a phonetic value. Typically a word will consist of a root or stem and zero or more affixes. Words can… …   Wikipedia

  • Boundary scan — is a method for testing interconnects (wire lines) on printed circuit boards or sub blocks inside an integrated circuit.The Joint Test Action Group (JTAG) developed a specification for boundary scan testing that was standardized in 1990 as the… …   Wikipedia

  • boundary */*/ — UK [ˈbaʊnd(ə)rɪ] / US noun [countable] Word forms boundary : singular boundary plural boundaries 1) a) something such as a line on a map that marks where one area of land ends and another begins a proposal to redraw the city boundary Akbar… …   English dictionary

  • BOUNDARY MARKING (LIMITAZIONE) —    The Etruscans, in common with the later Romans, applied a series of boundaries to public and private space. In the later period, this was intensely ritualized, and some cities such as Marzabotto, Spina, and Musarna, and one or two cemeteries… …   Historical Dictionary of the Etruscans

  • BOUNDARY STONES — (KUDURRU in babylonian)    Inscribed stone monuments in the shape of roundly dressed blocks were set up in temples and perhaps in special chapels to publicize the donation of land by the king in order to reward loyal subjects. The earliest… …   Historical Dictionary of Mesopotamia

  • Outflow boundary — For information on the book Gust Front by author John Ringo, see Legacy of the Aldenata. Outflow boundary on radar with radial velocity and frontal boundary drawn in. An outflow boundary, also known as a gust front, is a storm scale or mesoscale… …   Wikipedia

  • Inter-Entity Boundary Line — The Inter Entity Boundary Line (IEBL) divides Bosnia and Herzegovina into two entities, the Republika Srpska and the Federation of Bosnia and Herzegovina. The total length of the Inter Entity Boundary Line is 1,080 km. The IEBL essentially runs… …   Wikipedia

The importance of recognizing word boundaries is illustrated by this advertisement from the County Down Spectator.

In writing, word boundaries are conventionally represented by spaces between words. In speech, word boundaries are determined in various ways, as discussed below.

Related Grammatical and Rhetorical Terms

  • Assimilation and Dissimilation
  • Conceptual Meaning
  • Connected Speech
  • Intonation
  • Metanalysis
  • Mondegreen
  • Morpheme and Phoneme
  • Oronyms
  • Pause
  • Phonetics and Phonology
  • Phonological Word
  • Prosody
  • Segment and Suprasegmental
  • Slip of the Ear
  • Sound Change

Examples of Word Boundaries

  • «When I was very young, my mother scolded me for flatulating by saying, ‘Johnny, who made an odor?’ I misheard her euphemism as ‘who made a motor?’ For days I ran around the house amusing myself with those delicious words.» (John B. Lee, Building Bicycles in the Dark: A Practical Guide on How to Write. Black Moss Press, 2001
  • «I could have sworn I heard on the news that the Chinese were producing new trombones. No, it was neutron bombs.» (Doug Stone, quoted by Rosemarie Jarski in Dim Wit: The Funniest, Stupidest Things Ever Said. Ebury, 2008
  • «As far as input processing is concerned, we may also recognize slips of the ear, as when we start to hear a particular sequence and then realize that we have misperceived it in some way; e.g. perceiving the ambulance at the start of the yam balanced delicately on the top . . ..» (Michael Garman, Psycholinguistics. Cambridge University Press, 2000

Word Recognition

  • «The usual criterion for word recognition is that suggested by the linguist Leonard Bloomfield, who defined a word as ‘a minimal free form.’ . . .
  • «The concept of a word as ‘a minimal free form’ suggests two important things about words. First, their ability to stand on their own as isolates. This is reflected in the space which surrounds a word in its orthographical form. And secondly, their internal integrity, or cohesion, as units. If we move a word around in a sentence, whether spoken or written, we have to move the whole word or none of it—we cannot move part of a word.»
    (Geoffrey Finch, Linguistic Terms, and Concepts. Palgrave Macmillan, 2000)
  • «[T]he great majority of English nouns begins with a stressed syllable. Listeners use this expectation about the structure of English and partition the continuous speech stream employing stressed syllables.»
    (Z.S. Bond, «Slips of the Ear.» The Handbook of Speech Perception, ed. by David Pisoni and Robert Remez. Wiley-Blackwell, 2005)

Tests of Word Identification

  • Potential pause: Say a sentence out loud, and ask someone to ‘repeat it very slowly, with pauses.’ The pauses will tend to fall between words, and not within words. For example, the / three / little / pigs / went / to / market. . . .
  • Indivisibility: Say a sentence out loud, and ask someone to ‘add extra words’ to it. The extra item will be added between the words and not within them. For example, the pig went to market might become the big pig once went straight to the market. . . .
  • Phonetic boundaries: It is sometimes possible to tell from the sound of a word where it begins or ends. In Welsh, for example, long words generally have their stress on the penultimate syllable . . .. But there are many exceptions to such rules.
  • Semantic units: In the sentence Dog bites vicar, there are plainly three units of meaning, and each unit corresponds to a word. But language is often not as neat as this. In I switched on the light, the has little clear ‘meaning,’ and the single action of ‘switching on’ involves two words.​
    (Adapted from The Cambridge Encyclopedia of Language, 3rd ed., by David Crystal. Cambridge University Press, 2010)

Explicit Segmentation

  • «»[E]xperiments in English have suggested that listeners segment speech at strong syllable onsets. For example, finding a real word in a spoken nonsense sequence is hard if the word is spread over two strong syllables (e.g., mint in [mǀntef]) but easier if the word is spread over a strong and a following weak syllable (e.g., mint in [mǀntəf]; Cutler & Norris, 1988).
    The proposed explanation for this is that listeners divide the former sequence at the onset of the second strong syllable, so that detecting the embedded word requires recombination of speech material across a segmentation point, while the latter sequence offers no such obstacles to embedded word detection as the non-initial syllable is weak and so the sequence is simply not divided.
    Similarly, when English speakers make slips of the ear that involve mistakes in word boundary placement, they tend most often to insert boundaries before strong syllables (e.g., hearing by loose analogy as by Luce and Allergy) or delete boundaries before weak syllables (e.g., hearing how big is it? as how bigoted?; Cutler & Butterfield, 1992).
    These findings prompted the proposal of the Metrical Segmentation Strategy for English (Cutler & Norris, 1988; Cutler, 1990), whereby listeners are assumed to segment speech at strong syllable onsets because they operate on the assumption, justified by distributional patterns in the input, that strong syllables are highly likely to signal the onset of lexical words. . . .
    Explicit segmentation has the strong theoretical advantage that it offers a solution to the word boundary problem both for the adult and for the infant listener. . . .
    «Together these strands of evidence motivate the claim that the explicit segmentation procedures used by adult listeners may in fact have their origin in the infant’s exploitation of
    rhythmic structure to solve the initial word boundary problem.»​
    (Anne Cutler, «Prosody and the Word Boundary Problem.» Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition, ed. by James L. Morgan and Katherine Demuth. Lawrence Erlbaum, 1996)

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.

Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w. (I’m sure there was a good reason for it at the time).

The w stands for «word character». It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in w. But Java, JavaScript, and PCRE match only ASCII characters with w.

Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the b.

Note: I’m not sure what to do about mistakes in text, like when someone doesn’t put a space after a period at the end of a sentence. I allowed for it, but I’m not sure that it’s necessarily the right thing to do.

Anyway, in Java, if you’re searching text for the those weird-named languages, you need to replace the b with before and after whitespace and punctuation designators. For example:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "n" + line;
        }
    }
    return result.trim();
}

Then in your test or main function:

    String beforeWord = "(\s|\.|\,|\!|\?|\(|\)|\'|\"|^)";   
    String afterWord =  "(\s|\.|\,|\!|\?|\(|\)|\'|\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\b\.NET\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\bC#\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\bC\+\+\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\+\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\bJava\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\bjava\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\bC\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\bC\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

RegexBuddy—Better than a regular expression tutorial!

The metacharacter b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: b allows you to perform a “whole words only” search using a regular expression in the form of bwordb. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

Exactly which characters are word characters depends on the regex flavor you’re working with. In most flavors, characters that are matched by the short-hand character class w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for b but not for w.

Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a word. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you.

Since digits are considered to be word characters, b4b can be used to match a 4 that is not part of a larger number. This regex does not match 44 sheets of a4. So saying “b matches before and after an alphanumeric sequence” is more exact than saying “before and after a word”.

B is the negated version of b. B matches at every position where b does not. Effectively, B matches at any position between two word characters as well as at any position between two non-word characters.

Looking Inside The Regex Engine

Let’s see what happens when we apply the regex bisb to the string This island is beautiful. The engine starts with the first token b at the first character T. Since this token is zero-length, the position before the character is inspected. b matches here, because the T is a word character and the character before it is the void before the start of the string. The engine continues with the next token: the literal i. The engine does not advance to the next character in the string, because the previous regex token was zero-length. i does not match T, so the engine retries the first token at the next character position.

b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither between the i and the s.

The next character in the string is a space. b matches here because the space is not a word character, and the preceding character is. Again, the engine continues with the i which does not match with the space.

Advancing a character and restarting with the first regex token, b matches between the space and the second i in the string. Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second b at the position before the l. This fails because this position is between two word characters. The engine reverts to the start of the regex and advances one character to the s in island. Again, the b fails to match and continues to do so until the second space is reached. It matches there, but matching the i fails.

But b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s matches s. The last token in the regex, b, also matches at the position before the third space in the string because the space is not a word character, and the character before it is.

The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s. If we had used the regular expression is, it would have matched the is in This.

Tcl Word Boundaries

Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all. Tcl uses a different syntax.

In Tcl, b matches a backspace character, just like x08 in most regex flavors (including Tcl’s). B matches a single backslash character in Tcl, just like \ in all other regex flavors (and Tcl too).

Tcl uses the letter “y” instead of the letter “b” to match word boundaries. y matches at any word boundary position, while Y matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as b and B in Perl-style regex flavors. They don’t discriminate between the start and the end of a word.

Tcl has two more word boundary tokens that do discriminate between the start and end of a word. m matches only at the start of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it. It also matches at the start of the string if the first character in the string is a word character. M matches only at the end of a word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also matches at the end of the string if the last character in the string is a word character.

The only regex engine that supports Tcl-style word boundaries (besides Tcl itself) is the JGsoft engine. In PowerGREP and EditPad Pro, b and B are Perl-style word boundaries, while y, Y, m and M are Tcl-style word boundaries.

In most situations, the lack of m and M tokens is not a problem. ywordy finds “whole words only” occurrences of “word” just like mwordM would. Mwordm could never match anywhere, since M never matches at a position followed by a word character, and m never at a position preceded by one. If your regular expression needs to match characters before or after y, you can easily specify in the regex whether these characters should be word characters or non-word characters. If you want to match any word, yw+y gives the same result as m.+M. Using w instead of the dot automatically restricts the first y to the start of a word, and the second y to the end of a word. Note that y.+y would not work. This regex matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if your flavor supports m and M, the regex engine could apply mw+M slightly faster than yw+y, depending on its internal optimizations.

If your regex flavor supports lookahead and lookbehind, you can use (?<!w)(?=w) to emulate Tcl’s m and (?<=w)(?!w) to emulate M. Though quite a bit more verbose, these lookaround constructs match exactly the same as Tcl’s word boundaries.

If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use b(?=w) to emulate Tcl’s m and b(?!w) to emulate M. b matches at the start or end of a word, and the lookahead checks if the next character is part of a word or not. If it is we’re at the start of a word. Otherwise, we’re at the end of a word.

GNU Word Boundaries

The GNU extensions to POSIX regular expressions add support for the b and B word boundaries, as described above. GNU also uses its own syntax for start-of-word and end-of-word boundaries. < matches at the start of a word, like Tcl’s m. > matches at the end of a word, like Tcl’s M.

Boost also treats < and > as word boundaries when using the ECMAScript, extended, egrep, or awk grammar.

POSIX Word Boundaries

The POSIX standard defines [[:<:]] as a start-of-word boundary, and [[:>:]] as an end-of-word boundary. Though the syntax is borrowed from POSIX bracket expressions, these tokens are word boundaries that have nothing to do with and cannot be used inside character classes. Tcl and GNU also support POSIX word boundaries. PCRE supports POSIX word boundaries starting with version 8.34. Boost supports them in all its grammars.

Понравилась статья? Поделить с друзьями:
  • Word boundaries in speech
  • Word boundaries in sentences
  • Word boundaries in english
  • Word both adjective and adverb
  • Word born in germany