Log Analysis and Geographic Query Identification (LAGI)
Task Guidelines and Examples (version 2.1)

Alexander Yeh, Inderjeet Mani, Christine Doran
July 31, 2009
Contact email: asy at mitre.org (' at ' -> '@')

Copyright 2009 The MITRE Corporation. All rights reserved.
Approved for Public Release; Distribution Unlimited. Case# 09-3188

The task is to identify geographic elements in search log queries.

This is being done for two sets of logs:

  1. Tumba! - a Portuguese web search engine
  2. The European Library (TEL) - on line search for materials in various libraries in Europe. We are looking at the subset in English (which are the majority of the logs).


2009 Schedule

Time zone for times: Eastern United States Daylight Savings time


With this version of the guidelines and examples,
small sample data files are available, both before and after annotation.
Unfortunately, there is no time for producing larger training data sets.


July 31: Test data will be available at http://app01.iw.uni-hildesheim.de/~clef/LogCLEF/


August 7:
The annotated test data file are due back to Alexander Yeh by midnight.
They can be sent by email to: asy at mitre.org (' at ' -> '@').
The test data will have an UTF8 character encoding.
To try to preserve this encoding and the line endings used in the data files,
please send the files back as attachments
that have been compressed with gzip
(some compression schemes may change the line endings used in the text).
Please send the Tumba! and TEL data files back as separate files.


August 14:
The results will be sent to the participants.
Recall, precision and balanced F scores will be calculated
for finding places in the data sets.

Note that for various reasons,
it is possible that we will not score some parts of the test data.
If we do this, we will let you know what parts were not scored
when we send out the results.

Rules:

  1. The TEL queries have a large overlap with the dataset in Log Analysis for Digital Societies (LADS), the other track in LogCLEF 2009. You are NOT allowed to use the LADS dataset to help you with the LAGI task.
  2. A query is a geographical query if and only if it is bounded geographically. Thus, the queries "restaurant" or "restaurant food" are not considered geographical queries.
  3. The task is to mark the query with zero or more non-overlapping place tags indicating geographical elements. Each place tag marks a substring (proper or not) of the query, which we will call a place term. Queries that are classified as not being a geographical query are not marked with any place tags.
  4. A place term can be any country, a city or town, geographical feature, building, stadium, university, store, restaurant, statue, etc. described as a place in Wikipedia (Portuguese Wikipedia for Tumba! and English Wikipedia for our English subset of TEL). Wikipedia was chosen because it is reasonably comprehensive and is readily available. Since Wikipedia is constantly being updated, the evaluation will be using the particular versions that are mentioned later on. While those versions will be used to generate the official answers, participants are also free to use other wikipedia resources as well.
  5. A candidate place term can map to more than one possible meaning in Wikipedia.
    If the look-up of a candidate place term returns an initial Wikipedia disambiguation page linking directly to at least one disambiguated page describing a place, the candidate place term is tagged as a place.
    If the initial page from Wikipedia is not a disambiguation page, but contains a link to a disambiguation page, the candidate place term is marked as a place only if the initial page describes a place. The initial page is taken to represent the predominant sense of the candidate term.
    To do a look-up, type the candidate place term into the 'search' (English)/'busca' (Portuguese) area and then either press the 'Enter' key or click 'Go'/'Ir' (do not click 'Search'/'Pesquisa').

    This method of disambiguation is used when a query has no indicated preference for which sense to use. But if the query indicates a preference for a sense, then that sense is what is used. An example is the query 'casanova commune'.
    A search for 'casanova commune' in the English Wikipedia does not return an article. Rather it returns a 'search' page (instead of being an article for some term, the page gives a ranked list of articles that contain parts of the candidate place term somewhere in the articles' text), which is ignored in this evaluation.
    For 'casanova', the English Wikipedia returns an article on a person named Casanova, so that is the default predominant sense. But that article has a link to a disambiguation page, and the disambiguation page has a link to the place 'Casanova, Haute-Corse', which is a commune.
    This query indicates that this sense of 'casanova' is the preferred one for the query, so this overrides the default preferred sense based on the initial page returned by the Wikipedia.
  6. A place term can occur in a title (of a book, movie, team, etc.), but the title itself (if a different text span from the place) is not to be tagged.
    If the Wikipedia being used mentions that the phrase being examined is the title (of a book, movie, etc.), this Wikipedia entry is ignored for the purposes of this evaluation. For example, if TEL had the query 'paris match', a look-up in the English Wikipedia will return an article about a weekly magazine. For this evaluation, ignore the fact that 'paris match' is a weekly magazine.
    The reason for this rule is to make query processing more interesting for this evaluation given that the TEL queries are full of titles of books, etc.
  7. Capitalization (upper and lower case) in the query is ignored, as it is used inconsistently in the queries. Thus queries are treated as folded to lowercase.
  8. Acronyms are treated like other potential place terms. Example acronyms include 'USA' for 'United States of America' and 'FCUL' for 'Faculdade de Ciências da Universidade de Lisboa'. Acronyms are searched for in the Wikipedia as is, and are NOT expanded first.
    So for 'FCUL', one searches for 'FCUL' and not 'Faculdade de Ciências da Universidade de Lisboa'.
  9. There are no embedded place tags in this evaluation. Only the largest extent of a place term is marked.
    So with 'South America', mark 'South America' as a place, and not 'America'. A place term can include qualifications that provide additional context. Thus, in 'cavan county ireland', only 'cavan county ireland' is marked.
  10. Wildcards ('*') are ignored.
    So 'iceland*' is treated as if it were 'iceland', for which the English Wikipedia returns an article on the country 'Iceland'. 'ice*' is treated as if it were 'ice' (even though '*' will match 'land' and so 'ice*' will match 'iceland'), for which the English Wikipedia returns an article about frozen water.
  11. If some words of a query can be interpreted as forming a phrase, this will be preferred over interpreting those words as isolated words put in the same query.
    For example, with the query 'burlington university', given that the English Wikipedia does not have an article on a place with that name, the preference is interpret this as a 2 word phrase rather than just the isolated words 'university' and 'burlington'.
    But with 'university burlington', for which the English Wikipedia also does not have an article on a place with that name, one cannot interpret 'university burlington' as a phrase, so just interpret 'university burlington' as 2 isolated words. Also see Rule 11 below.
  12. When some part of a query forms a phrase, the words in the phrase that are not nouns may NOT have their regular 'non-name' meaning given in Wikipedia, and the Wikipedia may list some possible name meanings for those words.
    In this case, one should ignore the Wikipedia and use the 'regular non-name' meanings for these words if it makes sense.

    Some examples (more details given later on):

    'TEL: restaurant near university' - interpret 'near' as meaning 'close to'.
    This meaning is not in the English Wikipedia article on 'near'.
    That article does mention that 'near' may refer to the place 'near east'.
    This Wikipedia interpretation of 'near' will be ignored.

    'TEL: strongholds in xv century' - interpret 'in' as a preposition.
    This meaning is not in the English Wikipedia article on 'in'
    (Not in the version being used. The current English Wikipedia does mention this meaning).
    That article does mention 'in' as possibly referring to the
    following places:
    • India (country code),
    • Indiana in the US (postal abbreviation),
    • Ingolstadt in Germany (something to do with cars).
    These place interpretations will be ignored.

    'Tumba!: jornais de leiria' [in English: 'periodicals of leiria']
    - interpret 'de' as a preposition ('of').
    This meaning is not in the Portuguese Wikipedia article on 'de'.
    That article does mention that 'de' may refer to the following places:
    • Germany (Deutschland),
    • Delaware.
    These place interpretations will be ignored.
  13. Places that have never existed (imaginary places) are not marked. An example of such a place: 'Wonderland' in the story "Alice's Adventures in Wonderland".
  14. Redirection: when there are many ways to refer to a topic, a Wikipedia will often put an article on the topic just under one way to refer to it. The other ways will each just have a short 'article' that is a redirection page: a page that just points to the one way of reference that has the actual article. Usually, a Wikipedia automatically follows the links in such redirection pages and you do not have to do anything.

    For example, in English, a look-up of 'sicilia' (using the 2008 dumps mentioned below):
    http://app01.iw.uni-hildesheim.de/wikiclef/wiki-en/index.php5/Sicilia
    returns an article on 'Sicily' with the note '(Redirected from Sicilia)'.

    Another example is in Portuguese, a look-up of 'marinha grande':
    http://app01.iw.uni-hildesheim.de/wikiclef/wiki-pt/index.php5/Marinha_grande
    returns an article on 'Marinha Grande' with the note '(Redirecionado de Marinha grande)'.

    If the Wikipedia does not automatically follow-up such links, then you should follow these links yourselfs and act as if the Wikipedia did automatically follow these links.

    An example is in the disambiguation article on 'parisian':
    http://app01.iw.uni-hildesheim.de/wikiclef/wiki-en/index.php5/Parisian_(disambiguation)
    the link to 'Parisian (person)' is the page:
    http://app01.iw.uni-hildesheim.de/wikiclef/wiki-en/index.php5/Parisian_(person)
    This page is a 'Redirect page' to 'Paris' (following the link leads to an article on 'Paris'). In this case, act as if the page on 'Parisian_(person)' is the article on 'Paris'.

Wikipedia:

As mentioned in the rules above, this task makes use of the Wikipedia.
The task will make use of read-only restored versions of the
Wikipedia from 2008 dumps,
25th of June for Portuguese:
http://app01.iw.uni-hildesheim.de/wikiclef/wiki-pt/index.php5/P%C3%A1gina_principal

and 24th of May for English:
http://app01.iw.uni-hildesheim.de/wikiclef/wiki-en/index.php5/Main_Page

The Portuguese Wikipedia is quite large and the English Wikipedia is even larger.
It is possible that it will be somewhat slow when multiple groups
request look-ups at roughly the same time.
Under heavy load conditions, We have sometimes observed some Wikipedia servers
either return an empty web-page or time-out trying to respond.
If you will look-up the same term multiple times, it may help
you to store the results of your first look-up of a term so that you
can refer to those results locally when you want to look-up the same
term again later on.
Also, for faster responses, you may wish to use another version of the
Wikipedias on other web-sites (including the live Wikipedias
en.wikipedia.org and pt.wikipedia.org).
Be advised that these other versions will give responses that may be
different from the 2008 versions mentioned above,
and we used the above 2008 versions to generate the answer key.
In addition, due to differing amounts of network congestion and other factors,


Those pages for the 2008 dumps also give links to the dumps used in the
restoration for those interested
(the dump for English comes in 2 parts, which need to be concatenated together
using something like 'cat' in Unix).
As a warning, re-populating a Wikipedia with these dumps is hard.
Because of the difficulty in restoring the Wikipedias,
the version of the English Wikipedia being used in this evaluation
has two restrictions on it:
A. It has not been indexed for searching in the text of the articles
(matching a term to an article title still occurs).
This will not matter for the evaluation because as mentioned in rule 4,
the results found in 'search' page are ignored in this evaluation.

B. It does not contain articles with titles that have 'special' characters.
For example, the articles on "André Félibien", "Ångström"
and "Crécy-en-Ponthieu" are missing.
Because of this, when dealing with English, the evaluation will not look at data
containing special characters.

Note that these two restrictions are for the version of the English Wikipedia being used.
The version of the Portuguese Wikipedia being used does not have these restrictions
(although with rule 4, the search results returned by the Portuguese Wikipedia are
ignored).

An alternative url to reach both Wikipedias: http://www.uni-hildesheim.de/logclef/


The following may be helpful.
The files ptwiki-20080625pages-articles-titles.txt.gz
and enwiki-20080524pages-articles-titles.txt.gz
are available at the web-site http://app01.iw.uni-hildesheim.de/~clef/LogCLEF/
These files contain the lines with the "<title>...</title>" fields from the 2008 dumps.
These fields are the titles of the articles, so these files contain the titles of all the articles/pages in the dumps.
This may be useful to quickly determine whether there is an article/page with a particular name.

Examples:

Note: the format shown in the examples below is a preliminary format.
The exact format for the results and scoring method will be posted later.
---------

Tumba! Query Format

sample data files:
before annotation
after annotation

Each line in a data file are the words for one query
(in the same word order as in the query).

There may be some '+'s and/or '"' (double quote marks).
Treat a '+' like a space.
Ignore a '"'.

In front of each query are two numbers and two '@' characters in the
sequence '[number] @ [number] @'
Just ignore (neither alter nor annotate) this preliminary character sequence.

Examples include:
4333825 @ 4777 @ "administração escolar"
4933229 @ 7888 @ "escola+hip+hop"
6971716 @ 106342 @ jornais de leiria

which give, respectively, the queries:
"administração escolar"
"escola+hip+hop"
jornais de leiria

European Library Query Format

sample data files:
before annotation
after annotation
  1. A query string may appear enclosed inside a set of "'s.
    An example, "university of burlington"
  2. In that string, a space may be replaced by a '+'.
    An example: "university+of+burlington"
  3. A string may be placed inside a set of parentheses.
    One example: ("university of burlington")
    Another example: ("university+of+burlington")
  4. Inside the parentheses, just before the string, there may be one or two
    words to indicate the field in the catalog and the type of search to do
    with the string inside the parentheses.
    Two examples: (title all "university of burlington")
    (title all "university+of+burlington")
    Which indicates to look for a title containing the words in the string.
    The first word before the string can be one of the following:
    • title
    • creator
    • subject
    • type
    • language
    • isbn
    • issn
    • publisher
    The second word is 'all' (use all the words in the search)
    or 'exact' (only match the exact phrase in the search).
    Even with 'all', for this evaluation, rule 10 still applies:
    interpret the words in the string as a phrase when possible,
    as opposed to the words in the string being isolated words.

    Please ignore the text for specifying the 'language' field.
    For example,
    in '(language all "eng")',
    ignore "eng", which stands for English
    (the English Wikipedia mentions 'English', 'England' and
    some other possible meanings for "eng").
  5. These sets of parentheses may be combined with the boolean operators
    'and', 'or' and 'not'.
    In this evaluation, the only boolean operator that will be present
    is 'and',
    which will indicate that all the groups of words in the strings being
    present in a search item is the most desirable situation.
    Rule 10 does not apply to words from different strings being combined
    together with 'and'. So for example, in the query

    ("burlington") and ("university")

    treat "burlington" and "university" as isolated words and not the
    phrase 'burlington university',
    even though rule 10 would treat the string "burlington university"
    as a 2 word phrase.
  6. A variation of format A: no enclosing "'s.
    An example, university of burlington
  7. A variation of format D: no enclosing ( or ).
    An example: title all "university of burlington"
Note that the actual TEL queries may include some that are combinations
of the above formats.
For example: burlington and ("university")
These will not be included in this evaluation.

Each query is on one line.
Similar to Tumba! (but with a '&' instead of a '@'),
in front of each query are two numbers
and two '&' characters in the sequence '[number] & [number] &'
Just ignore (neither alter nor annotate) this preliminary character sequence.

Examples include:
902980 & 482 & (creator all "casanova")
906474 & 15432 & casanova
712725 & 5409 & ("cavan county ireland 1870")

which give, respectively, the queries:
(creator all "casanova")
casanova
("cavan county ireland 1870")