Background Knowledge Datasets

Background knowledge is the set of true facts used by semantic tools to draw their conclusions. For instance it may contain that dog is an animal or that Rome is a city and it is part of Italy.

Recent evaluations of matching systems show that lack of background knowledge, most often domain specific knowledge, is one of the key problems of matching systems these days. In fact, most state of the art systems, for the tasks of matching thousands of nodes show low values of recall (<30%), while with toy examples, the recall they demonstrated was most often around 90%.

WordNet, even if not specifically designed for this, is de facto used as background knowledge in many semantic applications. Unfortunately, its coverage of geographic information (and in general of domain specific knowledge) is very limited. In addition, WordNet does not provide latitude and longitude coordinates as well as other relevant information which is of fundamental importance in geo-spatial applications.

To overcome these limitations we created GeoWordNet [1].

GeoWordNet

A geo-spatial ontology is an ontology consisting of geo-spatial classes (e.g. lake, city), entities (e.g., Lago di Molveno, Trento), their metadata (e.g. latitude and longitude coordinates) and relations between them (e.g., part-of, instance-of). GeoWordNet is a multilingual geo-spatial ontology built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet.

The building process was largely automatic with a little amount of human intervention in critical parts where needed. As a consequence, we achieved at the same time a quite satisfactory qualitative and quantitative result. The core of the process is in fact represented by the manual mapping between the 663 GeoNames classes and WordNet synsets. When a corresponding synset was found in WordNet (exact match) we reused the existing synset to represent the class, otherwise we created a new synset for the class and connected it to its most suitable parent synset using a is-a or part-of relation. Using the mapping above, all the entities were imported as instances of the corresponding synset/concept.

For distribution purposes, we created several packages: relational, dict and RDF formats.

First, a set of relational tables to store this dataset. They are as follows:

  • Concept (con_id, name, gloss, lang, provenance)
  • Relation (src_con_id, trg_con_id, name, lang)
  • Entity (entity_id, name, con_id, lang, latitude, longitude, provenance)
  • Alternative_name_ENG (entity_id, name)
  • Alternative_name_ITA (entity_id, name)
  • Part_of (src_entity_id, trg_entity_id)

 

Concept table stores the concepts created from GeoNames classes, WordNet and MultiWordNet synsets. Note that for each Italian synset in MultiWordNet there is a corresponding English synset in WordNet and both of them represent the same concept. Relation table stores the relations between concepts. Entity table stores GeoNames locations and their attributes. We store the English name of the locations in the Entity table, the English alternative names in the Alternative_name_ENG table and the Italian alternative names in Alternative_name_ITA table. In the Part_of table, src_entity_id and trg_entity_id are GeoNames IDs of the part (child) and the whole (parent) locations, respectively.

Figure 1 shows the database schema used to store GeoWordNet. The schema is drawn using crow-feet notation.

GeoWordNet database schema
Figure 1: GeoWordNet database schema.

Second, dict format. GeoWordNet in a dict format is a rendering of a union of Princeton WordNet 3.0 and GeoWordNet into a Princeton WordNet dict format. This format is available in two packages: full and compatible. Full package contains all GeoWordNet content, at the cost of being compatible with only several libraries, while compatible package contains a portion of GeoWordNet content, small enough to satisfy the limitations of the format. The compatible version can be loaded by the original WordNet binary and many other libraries.

Third, RDF format. GeoWordNet in an RDF format is a rendering of GeoWordNet into RDF using a scheme developed for WordNet 3.0. For more information about the previous work on the topic read about WordNet 3.0 in RDF. RDF package is available for download and can be loaded as a unified model with WordNet 3.0 in RDF. One can use RDF libraries such as Jena to do this. In addition, this dataset is housed at http://geowordnet.semanticmatching.org and its URLs are available as RDF+XML content for Semantic Web agents and as HTML for humans.

GeoWordNet dataset is made publicly available as follows:

  • We provide the top 4 of the 7 levels of the part-of hierarchy
  • We provide 50% of the GeoNames classes with an exact match in WordNet and 50% of the new ones. The amount of classes taken per feature class is shown in Table 1.
  • We make available 70% of the entities per class


Table 1: Statistics of the classes per feature class.

Feature classClasses found in WordNetNewly created classes
Country, state, region66
Stream, lake3767
Parks, area1812
City, village27
Road, railroad510
Spot, building, farm8014
Mountain, hill, rock335
Undersea126
Forest, heath23
Total Number of Classes154180

 

GeoWordNet Public Dataset contains 3,698,238 entities, 3,698,237 part-of relations between entities, 334 concepts, 182 relations between concepts, 3,698,238 relations between instances and concepts, and 13,562 (English and Italian) alternative entity names.

The GeoWordNet Public Dataset is distributed under the Creative Commons Attribution 3.0 Unported License license terms and conditions. Please, acknowledge its use in your scientific work by citing: Fausto Giunchiglia and Vincenzo Maltese and Feroz Farazi and Biswanath Dutta. GeoWordNet: A Resource for Geo-spatial Applications. In Proc. of 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Crete, Greece, pp. 121-136.

A small sample is available for testing purposes, while the GeoWordNet Public Dataset itself can be freely downloaded in the downloads section. The installation documentation can be found in the documentation section.