Key word in context

From Wikipedia, the free encyclopedia

Key Word In Context (KWIC) is the most common format for concordance lines. The term KWIC was first coined by Hans Peter Luhn.[1] The system was based on a concept called keyword in titles which was first proposed for Manchester libraries in 1864 by Andrea Crestadoro.[2]

A KWIC index is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index.[3] It was a useful indexing method for technical manuals before computerized full text search became common.

For example, a search query including all of the words in an example definition («KWIC is an acronym for Key Word In Context, the most common format for concordance lines») and the Wikipedia slogan in English («the free encyclopedia»), searched against a Wikipedia page, might yield a KWIC index as follows. A KWIC index usually uses a wide layout to allow the display of maximum ‘in context’ information (not shown in the following example).

KWIC is an acronym for Key Word In Context, … page 1
… Key Word In Context, the most common format for concordance lines. page 1
… the most common format for concordance lines. page 1
… is an acronym for Key Word In Context, the most common format … page 1
Wikipedia, The Free Encyclopedia page 0
… In Context, the most common format for concordance lines. page 1
Wikipedia, The Free Encyclopedia page 0
KWIC is an acronym for Key Word In Context, the most … page 1
  KWIC is an acronym for Key Word … page 1
… common format for concordance lines. page 1
… for Key Word In Context, the most common format for concordance … page 1
  Wikipedia, The Free Encyclopedia page 0
KWIC is an acronym for Key Word In Context, the most common … page 1

A KWIC index is a special case of a permuted index.[4] This term refers to the fact that it indexes all cyclic permutations of the headings. Books composed of many short sections with their own descriptive headings, most notably collections of manual pages, often ended with a permuted index section, allowing the reader to easily find a section by any word from its heading. This practice, also known as Key Word Out of Context (KWOC), is no longer common.

In academia, The KWIC analysis result reflects the cognitive–emotional image perceived by the commenters in YouTube. The research studied number of times that the contextual words appeared on the right- and left-hand sides of the French keywords, and the score, which indicates the correlation between the French keywords and contextual French words in Architectural Heritage YouTube Videos.[5]

  • Keyword alongside context (KWAC)

    Keyword alongside context (KWAC)

  • Keyword in context (KWIC)

    Keyword in context (KWIC)

  • Keyword out of context (KWOC)

    Keyword out of context (KWOC)

References in literature[edit]

Note: The first reference does not show the KWIC index unless you pay to view the paper. The second reference does not even list the paper at all.

  • David L. Parnas uses a KWIC Index as an example on how to perform modular design in his paper On the Criteria To Be Used in Decomposing Systems into Modules, available as an ACM Classic Paper
  • Christopher D. Manning and Hinrich Schütze describe a KWIC index and computer concordancing in section 1.4.5 of their book Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press, 1999. ISBN 9780262133609. They cite an article from H.P. Luhn from 1960, «Key word-in-context index for technical literature (kwic index)».
  • According to Rev. Gerard O’Connor’s Concordantia et Indices Missalium Romanorum, «Most of the concordances produced in recent times and with the aid of computer software use both the KWIC (keyword in context) and KWICn (keyword in center) formats, which lists the keyword, usually highlighted in bold text in a consistent position, within a limited amount of context text, i.e. three [or] four words of the text prior to the keyword and the same amount of text following. This format is extremely useful in that the keyword is easily identified together with its context. … The Concordance of the Roman Missal is produced in both the KWIC and KWICn formats and is noteworthy in that each word form is listed as it appears in the text, that is, it is un-lemmatized.»

See also[edit]

  • ptx, a Unix command-line utility producing a permuted index
  • Concordancer
  • Concordance (publishing)
  • Burrows–Wheeler transform
  • Hans Peter Luhn
  • Suffix tree

References[edit]

  1. ^ Manning, C. D., Schütze, H.: «Foundations of Statistical Natural Language Processing», p. 35. The MIT Press, 1999
  2. ^ «Advanced Indexing and Abstracting Practies». Atlantic Publishers & Distri. Retrieved 26 March 2019 – via Google Books.
  3. ^ «KWIC indexes and concordances». Archived from the original on 2016-06-06. Retrieved 2016-06-17.
  4. ^ «3. Theory of KWIC indexing». Infohost.nmt.edu. Retrieved 26 March 2019.
  5. ^ Song et al (2023) The Cultivation Effect of Architectural Heritage YouTube Videos on Perceived Destination Image, Buildings 2023, 13(2), 508; https://doi.org/10.3390/buildings13020508

KWIC — это аббревиатура от Key Word In Context, наиболее распространенного формата для co. ncordance строк. Термин KWIC впервые был придуман Гансом Петером Луном. Система была основана на концепции, называемой ключевым словом в заголовках, которая была впервые предложена для библиотек Манчестера в 1864 году Андреа Крестадоро.

A Индекс KWIC формируется путем сортировки и выравнивания слов в заголовке статьи, чтобы каждое слово (кроме стоп-слов ) в заголовках для поиска в алфавитном порядке в указателе. Это был полезный метод индексации технических руководств до того, как компьютеризированный полнотекстовый поиск стал обычным.

Например, поисковый запрос, включающий все слова в заголовке этой статьи («KWIC является аббревиатурой от Key Word In Context, наиболее распространенного формата для строк соответствия ») и лозунга Википедии на английском языке (« бесплатная энциклопедия »), поиск которого выполняется на этой самой веб-странице, может дать индекс KWIC следующим образом. Индекс KWIC обычно использует широкий макет, чтобы обеспечить отображение максимальной информации «в контексте» (не показано в следующем примере).

KWIC — это акроним для ключевого слова в контексте,… страница 1
… ключевое слово в контексте, наиболее распространенное формат для строк соответствия. страница 1
… наиболее распространенный формат для соответствия строк. стр. 1
… является аббревиатурой от Key Word In Context, наиболее распространенного формата… стр. 1
Wikipedia, The Бесплатная Энциклопедия страница 0
… В контексте, наиболее распространенный формат для строк соответствия. стр. 1
Википедия, Бесплатная Энциклопедия стр. 0
KWIC — это аббревиатура от Key Word In Context, большинство… страница 1
KWIC — это аббревиатура от ключевого слова… страница 1
… общий формат для строк соответствия . страница 1
… для ключевого слова В контексте, наиболее распространенный формат согласования… страница 1
Википедия, The Free Энциклопедия страница 0
KWIC — это аббревиатура от Key Word . В контексте наиболее распространенный… страница 1

Индекс KWIC — это частный случай переставленного индекса. Этот термин относится к тому факту, что он индексирует все циклические перестановки заголовков. Книги, состоящие из множества коротких разделов с собственными описательными заголовками, в первую очередь сборники справочных страниц, часто заканчиваются разделом переставленного указателя, что позволяет читателю легко найти раздел по любому слову из его заголовка. Эта практика, также известная как KWOC («Ключевое слово вне контекста »), больше не распространена.

  • Ключевое слово в контексте (KWAC)

  • Ключевое слово в контексте (KWIC)

  • Ключевое слово вне контекста (KWOC)

Ссылки в литературе

Примечание: первая ссылка не показывает KWIC index, если вы не платите за просмотр статьи. Вторая ссылка даже не перечисляет статью.

  • Дэвид Л. Парнас использует Индекс KWIC в качестве примера того, как выполнять модульное проектирование в своей статье О критериях, которые должны использоваться при разложении систем на модули, доступной как ACM Classic Paper
  • Кристофер Д. Маннинг и Хинрих Шютце описывают индекс KWIC и компьютерное согласование в разделе 1.4.5 своей книги «Основы статистической обработки естественного языка». Кембридж, Массачусетс: MIT Press, 1999. ISBN 9780262133609 . Они цитируют статью H.P. Луна из 1960, «Контекстный указатель ключевых слов для технической литературы (kwic index)».
  • Согласно Concordantia et Indices Missalium Romanorum преподобного Джерарда О’Коннора, «Большая часть согласования, созданные в последнее время и с помощью компьютерного программного обеспечения, используют форматы KWIC (ключевое слово в контексте) и KWICn (ключевое слово в центре), в которых перечисляются ключевые слова, обычно выделенные жирным шрифтом в согласованной позиции, в ограниченном количестве контекстного текста, т.е. три [или] четыре слова текста перед ключевым словом и такое же количество текста после него. Этот формат чрезвычайно полезен в том смысле, что ключевое слово легко идентифицируется вместе с его контекстом…. Соответствие Римский Миссал выпускается в форматах KWIC и KWICn и примечателен тем, что каждая словоформа указана в том виде, в котором она представлена ​​в тексте, то есть без лемматизации ».

См. Также

  • ptx , служебная программа командной строки Unix, создающая пермутированный индекс
  • Concordancer
  • Concordance (публикация)
  • Преобразование Барроуза – Уиллера
  • Ганс Питер Лун
  • Суффиксное дерево

Ссылки

Table of Contents generated with DocToc

  • KWIC
    • Key Words In Context and Concordances
    • Objective of the kwic module
    • Usage
      • Three Steps
      • Single Step
      • Unicode Normalization
    • Relationship to Hollerith, the Binary Phrase DB
    • Related Software
  • KWIC
    • Key Words In Context and Concordances
    • Objective of the kwic module
    • Usage
      • Three Steps
      • Single Step
      • Unicode Normalization
    • Relationship to Hollerith, the Binary Phrase DB
    • Related Software

Table of Contents generated with DocToc

KWIC

Key Words In Context and Concordances

Keywords In Context (KWIC) is a technique to produce indexes based on rotary permutations
of the index linguistic material. According to Wikipedia:

KWIC is an acronym for Key Word In Context, the most common format for
concordance lines.
The term KWIC was first coined by Hans Peter
Luhn.[1]
The system was based on a concept called keyword in titles which was first
proposed for Manchester libraries in 1864 by Andrea
Crestadoro.[2]

The best way to understand what the KWIC technique is all about is to skim
through the pages of a classical KWIC index, of which Computer Literature
Bibliography: 1946 to 1963

is one example. Here’s page 129 from that 1965 book:

Computer Literature Bibliography: 1946 to 1963, page 129

This title word index goes on for over 300 pages. In the center of each page is
the current keyword, from A.C.E. and ABACUS over COMPUTER to ZURICH,
followed by numbers representing years and machine models. Each line represents
the title of a book or article, and each keyword is surrounded by those words as
appear in the referenced title, in the order they appear there. At the end of
each line, we find an abbreviation that identifies the publication and the page
number where each title is to be found.

(by contrast, other indexing methods are known to shuffle words around, as in
Man and the Sea, The Old, in order to make the relevant item appear in the
right place in the catalog. For the sake of space efficiency, long titles are
wrapped around here, too; however, this does not affect the sorting order).

The benefits of the KWIC approach to indexing are immediately obvious: instead
of having to guess where the librarian chose to place the index card for that
edition of The Old Man and the Sea you’re looking for (Sea? Old? Man?),
you can look it up under each ‘content’ word. Also, you’ll likely have an easier
time to find works about related subjects where those share title words with
your particular search.

What’s more, you get a collocational analysis of sorts for free, that is, given
a comprehensive KWIC index covering titles (and maybe full texts) of a given
field, you can gain an idea of what words go with which ones. The index as shown
above is admittedly much better at showing occurrences of type I+S (where I
is what you searched for, call it the infix, and S is what follows, call it
the suffix) than for those of type P+I (where P is what precedes the
infix, call it the prefix); this becomes clear when you compare the entries
near COMPUTER LANGUAGE from the picture above with the entries of pages
225f.
of the same work: To the naked eye, the prefixes you might be interested in
(say, COMPUTER, PROGRAMMING or ALGORITHMIC on the above page) are rather
haphazardly strewn across the place, although some clusters do seem to occur
(this, by the way, is a weakness of this particular index that we will address
and try to remedy a little further down).

The main downside of KWIC indexes is also apparent from the Bibliography:
Whereas the register (where all the abbreviations of cited publications are
listed) takes up roughly 70 and the author index roughly 80 pages, the KWIC
index as such weighs in with 307 pages, meaning each title appears around 4.5
times on average. This can hardly be otherwise given that the very objective of
the KWIC index is exactly to file each title under each relevant term; however,
it also helps to explain why printed KWIC indexes had, for the most part, to
wait for computers to arrive and went out of fashion as soon as computers became
capable of delivering documents online as well. Similarly,
concordances were only done for
subjects and keywords deemed worthy the tremendous effort in terms of time and
paper.

Objective of the kwic module

kwic is a NodeJS module; as such, you can install it
with npm install kwic. When you then do node lib/demo.js, you
will be greeted with the following output:

                    |a         a                #  1
                   b|a         ba               #  2
                  cb|a         cba              #  3
                   c|a         ca               #  4
                  bc|a         bca              #  5
                    |ab        ab               #  6
                   c|ab        cab              #  7
                    |abc       abc              #  8
                   c|abdriver  cabdriver        #  9
                   c|abs       cabs             # 10
                    |ac        ac               # 11
                   b|ac        bac              # 12
                    |acb       acb              # 13
                   c|ad        cad              # 14
                    |b         b                # 15
                   a|b         ab               # 16
                  ca|b         cab              # 17
                   c|b         cb               # 18
                  ac|b         acb              # 19
                    |ba        ba               # 20
                   c|ba        cba              # 21
                    |bac       bac              # 22
                    |bc        bc               # 23
                   a|bc        abc              # 24
                    |bca       bca              # 25
                  ca|bdriver   cabdriver        # 26
                  ca|bs        cabs             # 27
                    |c         c                # 28
                   a|c         ac               # 29
                  ba|c         bac              # 30
                   b|c         bc               # 31
                  ab|c         abc              # 32
                    |ca        ca               # 33
                   b|ca        bca              # 34
                    |cab       cab              # 35
                    |cabdriver cabdriver        # 36
                    |cabs      cabs             # 37
                    |cad       cad              # 38
                    |cb        cb               # 39
                   a|cb        acb              # 40
                    |cba       cba              # 41
                  ca|d         cad              # 42
                 cab|driver    cabdriver        # 43
             cabdriv|er        cabdriver        # 44
               cabdr|iver      cabdriver        # 45
            cabdrive|r         cabdriver        # 46
                cabd|river     cabdriver        # 47
                 cab|s         cabs             # 48
              cabdri|ver       cabdriver        # 49

The above is a KWIC-style permuted index of these (artificial and real) ‘words’,
chosen to highlight some characteristics of the implemented algorithm:

a           ba          cab
ab          bac         cabdriver
abc         bc          cabs
ac          bca         cad
acb         c           cb
b           ca          cba

First of all, the demo shows how to index words by their constituent letters
(and not phrases by their constituent words, as the classical exemplar does);
this is related to the particular intended use case, but configurable.

Next, there’s a vertical line in the output shown: this line indicates the
separation between what was called above the prefix and the infix, with the
suffix starting at the next position after the infix. Now when you read from
top to bottom along said line, you will observe that

(1)—all the infixes are listed in alphabetical order (actually, in a simplified
version of that, Unicode lexicographical order);

(2)—all the suffixes, likewise, are in alphabetical order, so that

(3)—all the co-occurrances of a given infix with all subsequent suffixes (trailing
letters in this case) are always neatly clustered. For example, all occurrances
of |ca... (infix a plus all the suffixes starting with an a) are
found on lines #33 thru #38 in the above output and nowhere else. The
inverse also holds: wherever the sequence c, ‘a’ occurs in the listing, it is
always a duplicate of one entry in said range, indexed by another letter.

(4)—Wherever a new group of a given infix (index letter) starts, the sole
letter always comes first, followed by all those entries that end in
that letter
; this is a corollary of the previous points (and happens to be
in agreement with how the Bibliography treats
this case).

(5)—After the words that end with the index letter come the ones that start
with that letter, short ones with letters early in the alphabet (a, b, …)
occurring first.

(6)—These in turn—and now it gets interesting—are interspersed by those words
that contain the infix and are preceded by one or more letters, and here
the rule is again that short words and early letters sort first, but in the
prefix, power of ordering counts
backwards from the infix at the right
down to the start of the prefix on the left
. The effect is that for any
given run of a common infix and suffix, same letters to the left of the
vertical line have a tendency to form secondary clusters.

Prefixes cannot possibly all cluster together as long as we stick to
granting the suffix priority in sorting; after all, a list of items still
has only a single dimension and, hence, neighborhood has only two
positions. This is why you see c|b and ac|b right next to each other,
but c|ba, which also has the sequence c|b, is separated by |ba (and
had we included, say, bank and bar, those would likewise intervene).

Usage

Three Steps

You can use KWIC doing three small steps or doing a single step; the first way
is probably better when making yourself comfortable with KWIC, to find out where
things went wrong or to modify intermediate data. The single-step API is more
convenient for production and discards unneeded intermediate data. in both
cases, the objective is to input some entries and get out a number of
datastructures—called the ‘permutations’—that can be readily used in conjunction
with the Hollerith CoDec
and a LevelDB instance to produce a properly sorted KWIC index.

Let’s start with the ‘slow’ API. The first thing you do is to compile a list of
entries (e.g. words) and prepare an empty list (call it collection) to hold
the permuted keys. Assuming you start with a (somewhat normalized) text and use
the unique_words_from_text function as found in src/demo.coffee, the steps
from raw data to output look like this:

entries     = unique_words_from_text text                       # 1
collection  = []                                                # 2
for entry in entries                                            # 3
  factors       = KWIC.get_factors      entry                   # 4
  weights       = KWIC.get_weights      factors                 # 5
  permutations  = KWIC.get_permutations factors, weights        # 6
  collection.push [ permutations, entry, ]                      # 7
  # does nothing, just in case you want to know                 # 8
  for permutation in permutations                               # 9
    [ r_weights, infix, suffix, prefix, ] = permutation         # 10
KWIC.report collection                                          # 11

In the first step, each entry that you iterate over gets split into a list of
‘factors’. Each factor represents what is essentially treated as a unit by the
algorithm; that could be Unicode characters (codepoints), or stretches of
letters representing syllables, morphemes, or orthographic words; this will
depend on your use case.

For ease of presentation, the default of KWIC.get_factors is to split each
given string into characters. If you want something different, you may specify a
factorizer as second argument which should be a function (or the name of a
registered factorizer method) that accepts an entry string and returns a list of
strings derived from that input. For commonly occurring cases, a number of named
factorizers is included as KWIC.factorizers.

In the second step, each list of factors gets turned into a list of weights.
Weights are what will be used for sorting; typically, these are non-negative
integer numbers, but you could use anything that can be sorted, such as rational
numbers, negative numbers, lists of numbers or, in fact, strings. As with
get_factors, get_weights accepts a second argument, called alphabet in
this case, which may be the name of one of the standard alphabets registered in
KWIC.alphabets, a function that returns a weight when called with a factor, or
a list that enumerates all possible factors (and whose indices will become
weights). The general rule is that wherever a given weight is smaller that
another one, the first will sort before the second, and vice versa.

The essential part happens with the third call, permutations = KWIC.get_permutations factors, weights. You can treat the return value as a
black box; the idea is to push a list with the permutations as the the first
and whatever identifies your entry as the second element to your collection
in order to prepare for sorting and output. When you’re done with collecting the
entries, you can pass the collection to KWIC.report, which will then sort
and print the result. If you need any kind of further data to point from each
index row back into, say, page and line numbers where those entries originated,
you’ll have to organize that part yourself (you could make each `entry´ an
object with the pertinent data attached to it and use a custom factorization
method).

In case you’re interested, the permutations list will contain as many
sub-lists as there were factors, one for each occurrence of the entry in the
index. Each sub-list starts with an item internally called the r-weights, that
is, a rotated list of weights. Let’s look at the permutations for the entry
cabs:

factors = [ 'c', 'a', 'b', `s`, ]                           # 4
weights = [  99,  97,  98, 115, ]                           # 5

permutations = [                                            # 6
  [ [  99,    97,    98,   115,  null, ], 'c', [ 'a', 'b', 's', ], [                ], ]
  [ [  97,    98,   115,  null,    99, ], 'a', [ 'b', 's',      ], [ 'c',           ], ]
  [ [  98,   115,  null,    97,    99, ], 'b', [ 's',           ], [ 'c', 'a',      ], ]
  [ [ 115,  null,    98,    97,    99, ], 's', [                ], [ 'c', 'a', 'b', ], ]
  ]

Lines marked #4 show the factors of the word cabs, which are simply its
letters or characters. The #5 points to the weights list, which, again is just
each character’s Unicode codepoint expressed as a decimal number (in this simple
example, we could obviously just sort using Unicode strings, but on the other
hand, that simplicity immediately breaks down as soon when a more
locale-specific sorting is needed, such as treating German ä either as ‘a kind
of a‘ or as ‘a kind of ae‘).

Now the first item in each of the three sub-lists of the permutations (lines marked #6)
contains the ‘rotated weights’ mentioned above. In fact, those weights are not only rotated,
they

  • are extended with an ‘null’ value (which, being (almost) the smallest possible value
    when Hollerith CoDec-encoded,
    will sort before anything else);

  • contain the weights for the suffixes in reversed order; replacing the
    numbers with letters (using _ to replace null), the r-weight entries
    are cabs_, abs_c, bs_ac, and s_bac, respectively.

The second point is what causes a slightly more meaningful ordering of the
entries with a slightly better and more interesting aub-clustering when sorted.

The remaining three entries in each permutation sub-list contain, in terms of
factors, the infix I, the suffix S and the (non-reversed) prefix P that
the r-weights represent in terms of weights. These data are made part of the
information that goes into the LevelDB key for the simple reason that a
reconstruction of these pieces from the rotated weights would be awkward, and
retrieval from separate keys or external data sources cumbersome.

In case you should be wondering: yes, this extra data, being put into a DB
key, does potentially affect the overall sorting of entries, but the Hollerith
CoDec being constructed the way it is, the factors can only ever have a
bearing on the sorting if two identical r-weight lists (same lengths, same
contents) happen to occur with differing factors. In that case, Unicode
lexicographic ordering comes into effect. With a properly implemented scheme
of factorizations and weightings, that should never happen.

Single Step

The exact same result as above may be obtained by a slightly simpler procedure
that hides the intermediate results:

entries     = unique_words_from_text text
collection  = []
for entry in entries
  collection.push [ ( KWIC.permute entry ), entry, ]
KWIC.report collection

Here, we call KWIC.permute on each entry. This method will most of the time be called
with an additional settings object such as:

my_factorizer = ( entry  ) -> return words of entry
my_weighter   = ( factor ) -> return fancy localized weight for factor

...

for entry in entries
  permutations = KWIC.permute entry, factorizer: my_factorizer, alphabet: my_weighter
  collection.push [ permutations, entry, ]

Relationship to Hollerith, the Binary Phrase DB

The keys generated by KWIC.permute may be used as keys in a Hollerith Phrase
DB, as they are Hollerith
CoDec-compliant. In fact,
the KWIC demo converts encodes all generated keys to NodeJS buffers which are
then sorted using buffer.compare; thus, consistency with LevelDB’s
lexicographical sorting is ensured even without using a DB instance for the
purpose.

Related Software

There is a rather obscure UNIX utility by the name of
ptx and an even more obscure
GNU version of the same,
gptx.

July 03, 2019

Keyword in Context (KWIC) Indexing

Keyword in Context (KWIC) Indexing system is based on the principle that the title of the document represents its contents. It is believed that the title of the document is one line abstract of the document. The significant words in the title indicate the subject of the document. a KWIC index makes an entry under each significant word in the title, along with the remaining part of the title to keep the context intact. The entries are derived using terms one by one as the lead term along with the entire context for each entry.

(a) Structure

Each entry in KWIC index consists of three parts

i) Keyword: Significant words of the title which serve as approach/access teems.

ii) Context: The rest of the terms of the title provided along with the keywords specifies the context fo the document.

iii) Identification or Location Code: A code (usually the social number of the entry) which provides address of the document where its full bibliographical details will be available.

In order to indicate the end of the title a “/” symbol is used. The identification code is put on the extreme right to indicate the location of the document.

(b) Indexing Process

KWIC indexing system consists of three steps

Step I: Keyword selection

Step II: Entry generation

Step III: Filing

Step I: First of all significant words or keywords are selected from the title. It is done by omitting articles, prepositions, conjunctions and others non-significant words or terms. The selection is done by the editor who marks the keywords. When a computer is used for preparing an index, the selection is done by having ‘stop list’ of non-significant terms stored in it. A stop list consists of articles, prepositions and certain other common words which would be stopped from becoming the keywords. Another method of providing the correct terms f entries is by human intervention at the input stage, wherein the editor indicates the key terms which are then picked up by the computer.

Step II: After the selection of keywords, the computer moves the title laterally in such a way that a significant word (keyword) for a particular entry always appears either on the extreme left-hand side or in the center. The same thing can be performed manually following the structure of KWIC to generate entries.

Step III: After all the index entries for a document are generated, each entry is filed at its appropriate place in the alphabetical sequence.

Example: Classification of Books in a University Library (with identification code 1279)

Step I :

Classification Books University Library

StepII :

CLASSIFICATION of Books in a University Library 1279

Books in a University Library/Classification of 1279

UNIVERSITY Library/Classification of Books in 1279

LIBRARY/Classification of Books in University 1279

Step III :

Books in a University Library/Classification of 1279

CLASSIFICATION of Books in a University Library 1279

LIBRARY/Classification of Books in a University 1279

UNIVERSITY Library/Classification of Books in a 1279

The keyword may also be in the centre as follows:

Classification of BOOKS in a University Library 1279

University Library CLASSIFICATION of Books in a 1279

in a University LIBRARY/Classification of Books 1279

of Books in a UNIV. LIBRARY/Classification 1279




SOURCE 

  • Information Access Through The Subject : An Annotated Bibliography (Chapter 3)  / by Salman Haider. — Online : OpenThesis, 2015. (408 pages ; 23 cm.)

Annotated bibliography titled Information Access Through The Subject covering Subject Indexing, Subject Cataloging, Classification, Artificial Intelligence, Expert Systems, and Subject Approaches in Bibliographic and Non-Bibliographic Databases etc. 

Contents

  • Lesson Goals
  • Files Needed For This Lesson
  • From Text to N-Grams to KWIC
  • From Text to N-grams
  • Code Syncing

Lesson Goals

Like in Output Data as HTML File, this lesson takes the frequency
pairs collected in Counting Frequencies and outputs them in HTML.
This time the focus is on keywords in context (KWIC) which creates
n-grams from the original document content – in this case a trial
transcript from the Old Bailey Online. You can use your program to
select a keyword and the computer will output all instances of that
keyword, along with the words to the left and right of it, making it
easy to see at a glance how the keyword is used.

Once the KWICs have been created, they are then wrapped in HTML and sent
to the browser where they can be viewed. This reinforces what was
learned in Output Data as HTML File, opting for a slightly
different output.

At the end of this lesson, you will be able to extract all possible
n-grams from the text. In the next lesson, you will be learn how to
output all of the n-grams of a given keyword in a document downloaded
from the Internet, and display them clearly in your browser window.

Files Needed For This Lesson

  • obo.py

If you do not have these files from the previous lesson, you can
download programming-historian-7, a zip file from the previous lesson

From Text to N-Grams to KWIC

Now that you know how to harvest the textual content of a web page
automatically with Python, and have begun to use strings, lists and
dictionaries for text processing, there are many other things that you
can do with the text besides counting frequencies. People who study the
statistical properties of language have found that studying linear
sequences of linguistic units can tell us a lot about a text. These
linear sequences are known as bigrams (2 units), trigrams (3 units), or
more generally as n-grams.

You have probably seen n-grams many times before. They are commonly used
on search results pages to give you a preview of where your keyword
appears in a document and what the surrounding context of the keyword
is. This application of n-grams is known as keywords in context (often
abbreviated as KWIC). For example, if the string in question were “it
was the best of times it was the worst of times it was the age of wisdom
it was the age of foolishness” then a 7-gram for the keyword “wisdom”
would be:

the age of wisdom it was the

An n-gram could contain any type of linguistic unit you like. For
historians you are most likely to use characters as in the bigram “qu”
or words as in the trigram “the dog barked”; however, you could also use
phonemes, syllables, or any number of other units depending on your
research question.

What we’re going to do next is develop the ability to display KWIC for
any keyword in a body of text, showing it in the context of a fixed
number of words on either side. As before, we will wrap the output so
that it can be viewed in Firefox and added easily to Zotero.

From Text to N-grams

Since we want to work with words as opposed to characters or phonemes,
it will be much easier to create n-grams using a list of words rather
than strings. As you already know, Python can easily turn a string into
a list using the split operation. Once split it becomes simple to
retrieve a subsequence of adjacent words in the list by using a slice,
represented as two indexes separated by a colon. This was introduced
when working with strings in Manipulating Strings in Python.

message9 = "Hello World"
message9a = message9[1:8]
print(message9a)
-> ello Wo

However, we can also use this technique to take a predetermined number
of neighbouring words from the list with very little effort. Study the
following examples, which you can try out in a Python Shell.

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()

print(wordlist[0:4])
-> ['it', 'was', 'the', 'best']

print(wordlist[0:6])
-> ['it', 'was', 'the', 'best', 'of', 'times']

print(wordlist[6:10])
-> ['it', 'was', 'the', 'worst']

print(wordlist[0:12])
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']

print(wordlist[:12])
-> ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times']

print(wordlist[12:])
-> ['it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']

In these examples we have used the slice method to return parts of our
list. Note that there are two sides to the colon in a slice. If the
right of the colon is left blank as in the last example above, the
program knows to automatically continue to the end – in this case, to
the end of the list. The second last example above shows that we can
start at the beginning by leaving the space before the colon empty. This
is a handy shortcut available to keep your code shorter.

You can also use variables to represent the index positions. Used in
conjunction with a for loop, you could easily create every possible
n-gram of your list. The following example returns all 5-grams of our
string from the example above.

i = 0
for items in wordlist:
    print(wordlist[i: i+5])
    i += 1

Keeping with our modular approach, we will create a function and save it
to the obo.py module that can create n-grams for us. Study and type or
copy the following code:

# Given a list of words and a number n, return a list
# of n-grams.

def getNGrams(wordlist, n):
    return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]

This function may look a little confusing as there is a lot going on
here in not very much code. It uses a list comprehension to keep the
code compact. The following example does exactly the same thing:

def getNGrams(wordlist, n):
    ngrams = []
    for i in range(len(wordlist)-(n-1)):
        ngrams.append(wordlist[i:i+n])
    return ngrams

Use whichever makes most sense to you.

A concept that may still be confusing to you are the two function
arguments. Notice that our function has two variable names in the
parentheses after its name when we declared it: wordlist, n. These two
variables are the function arguments. When you call (run) this function,
these variables will be used by the function for its solution. Without
these arguments there is not enough information to do the calculations.
In this case, the two pieces of information are the list of words you
want to turn into n-grams (wordlist), and the number of words you want
in each n-gram (n). For the function to work it needs both, so you call
it in like this (save the following as useGetNGrams.py and run):

#useGetNGrams.py

import obo

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
allMyWords = wordstring.split()

print(obo.getNGrams(allMyWords, 5))

Notice that the arguments you enter do not have to have the same names
as the arguments named in the function declaration. Python knows to use
allMyWords everywhere in the function that wordlist appears, since this
is given as the first argument. Likewise, all instances of n will be
replaced by the integer 5 in this case. Try changing the 5 to a string,
such as “elephants” and see what happens when you run your program. Note
that because n is being used as an integer, you have to ensure the
argument sent is also an integer. The same is true for strings, floats
or any other variable type sent as an argument.

You can also use a Python shell to play around with the code to get a
better understanding of how it works. Paste the function declaration for
getNGrams (either of the two functions above) into your Python shell.

test1 = 'here are four words'
test2 = 'this test sentence has eight words in it'

getNGrams(test1.split(), 5)
-> []

getNGrams(test2.split(), 5)
-> [['this', 'test', 'sentence', 'has', 'eight'],
['test', 'sentence', 'has', 'eight', 'words'],
['sentence', 'has', 'eight', 'words', 'in'],
['has', 'eight', 'words', 'in', 'it']]

There are two concepts that we see in this example of which you need to
be aware. Firstly, because our function expects a list of words rather
than a string, we have to convert the strings into lists before our
function can handle them. We could have done this by adding another line
of code above the function call, but instead we used the split method
directly in the function argument as a bit of a shortcut.

Secondly, why did the first example return an empty list rather than the
n-grams we were after? In test1, we have tried to ask for an n-gram that
is longer than the number of words in our list. This has resulted in a
blank list. In test2 we have no such problem and get all possible
5-grams for the longer list of words. If you wanted to you could adapt
your function to print a warning message or to return the entire string
instead of an empty list.

We now have a way to extract all possible n-grams from a body of text.
In the next lesson, we can focus our attention on isolating those
n-grams that are of interest to us.

Code Syncing

To follow along with future lessons it is important that you have the
right files and programs in your “programming-historian” directory. At
the end of each chapter you can download the “programming-historian” zip
file to make sure you have the correct code. If you are following along
with the Mac / Linux version you may have to open the obo.py file and
change “file:///Users/username/Desktop/programming-historian/” to the
path to the directory on your own computer.

  • python-lessons8.py (zip sync)

Понравилась статья? Поделить с друзьями:
  • Key word in content
  • Key word google search
  • Key word for presentation
  • Key word for interviews
  • Key word for ielts