Thursday, May 13, 2010

Semantic Web Introduction - Part 2: RDF

This post represents the second in a series of posts that aim to introduce some of the concepts and technologies used as part of the Semantic Web. The first in the series attempted to provide a brief overview of the motivation behind the Semantic Web. In this post, one of the core technologies used to describe entities is discussed.

The Resource Description Framework is one of the fundamental Semantic Web standards in that it specifies how data in the form of statements are structured. A statement is made up of three parts, a subject, a predicate, and an object. For example, a statement stating the origins of Penicillin could be written in the following form.

AlexanderFleming discovered Penicillin.

In this example AlexanderFleming is the subject, discovered is the predicate, and Penicillin is the object. Statements in this form are referred to as triples and a software component that provides storage and access to them is called a triple store. A set of triples can be represented as a graph. For example, if we further want to state that Penicillin is used to treat Staphylococcus we might represent the statements in the form of triples as the following.

AlexanderFleming discovered Penicillin.
Penicillin treats Staphylococcus.


A graph representation of the above statements could be shown as in Figure 1.

Figure 1 – Basic RDF Graph

In a graph representation, the subjects and objects from the triples are depicted as nodes and the predicates form the edges. In RDF, there are two types of nodes, resources and literals. A resource is an entity that is represented by the node and has associated with it a Universal Resource Identifier (URI), while a literal is a constant value. While resources can be the subject or the object in a triple, literals can only serve as the object. An additional feature of RDF is that the predicates are themselves resources. It is common in RDF to refer to predicates as properties. When the properties relate two resources the property is referred to as an object property and when it relates a resource to a literal it is referred to as a data property.

Similar to other data formats like XML, RDF supports the concept of namespaces. This is an important aspect of the specification as it allows the same name to be used in different contexts or vocabularies without conflicting or causing ambiguity. Namespaces are identified by URIs and prefix the resource names that use them. The prefixes and resource names are typically separated by a colon and most RDF serializations support the definition of abbreviations to make the date more compact and readable.

The RDF specification is itself a namespace and includes a few resources that can be used to help define data. One of the most commonly used is rdf:type which allows a resource to be included in a class or category of resources. In a triple that uses rdf:type as the property, the subject is defined as being an instance of the class represented by the object. There is no limitation that a subject can be declared as an individual of a single class. While RDF allows for declaring that a resource is an individual of a class it does not support the modeling of classes. This is the topic of the next two Semantic Web technologies RDFS and OWL which are discussed in the future posts.

Data modeled in RDF can be serialized in a number of different formats including RDF/XML, N3 (Notation 3 RDF), N-Triples, and Turtle (Terse RDF Triple Language). The individual serializations have different benefits and drawbacks. For example RDF/XML is supported by many tools and can be parsed by standard XML tools such as XPath, but is more difficult for human reading than Turtle. Fortunately, tools often support multiple serializations and libraries such as the open source Jena package make it easy to convert between them. Resources for the different serialization formats can be found at the end of this post.

The Turtle RDF serialization is being used for examples in this blog post series as it is compact and is relatively readable. The base for of a triple in Turtle (hint the above examples are using it) is to write the subject, predicate, and object separated by a space and ended with a period. Turtle includes a few shortcuts for writing multiple statements about the same subject or using the same predicate. When the next statement is about the same subject a semi-colon is inserted in place of the period after the first statement.

AlexanderFleming discovered Penicillin;
                 bornIn “Scottland”.


When the next statement is regarding the same subject and predicate a comma is used to separate the objects. Note also that literals are enclosed in double quotes.

AlexanderFleming profession “Biologist”, “Pharmacologist”.

As was mentioned above, the subject of a triple must be a resource and all named resources must have a URI. However, it is often desirable to make a statement about a resource where it is impractical to make it a named resource with a URI. To address these situations RDF includes the concept of blank nodes. Blank nodes can have local names which have a scope local to the document they are used in. This technique makes it possible to make statements about statements (i.e., reification). In Turtle blank nodes are enclosed in brackets. The example below demonstrates this construct in that the side effect is not named but statements about its properties are made.

Penicillin hasSideEffect [ duration “2 days”;
                           acute “No”].


In the next part in the series, the RDF Schema standard will be discussed. It will be shown how RDFS can be used to define hierarchies of classes and properties.

References and Resources

No comments: