A Brief Introduction to Wikidata

Apr 10, 2018 · 5 minute read Wikidata

Wikidata

Have you ever heard about Wikidata? If not, you might think of Wikipedia first — and that is not wrong. Wikidata is also a project of the Wikimedia Foundation. In particular:

“Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia (…).”

Loosely, you could describe Wikidata as Wikipedias database with over 46 million data items (April 2018). And in line with Wikimedia’s mission, everyone can add and edit data, and use it for free.

Available data

Like Wikipedia, there are all kinds of data stored in Wikidata. As such, when you are looking for a specific dataset or if you want to answer a curious question, it can be a good start looking for that data at Wikidata first. Example questions can be:

  1. What is the capital city of every member of the European Union and how many inhabitants live there?
  2. How do the Nobel Prize winners in Physics look like?
  3. Which countries use 112 as an emergency number?

(To see the answer, scroll down)

Advantages and Disadvantages of Wikidata

There are some aspects you should keep in mind when using Wikidata. Whether they are an advantage or disadvantage, however, depends on you:

Wikidata…

  • is a free and open knowledge base that can be read and edited by both humans and machines
  • contains various data types (e.g. text, images, quantities, coordinates, geographic shapes, dates)
  • uses SPARQL

Especially the last aspect allows you very interesting questions like to ones above. If you have never used SPARQL before, however, it might be a struggle in the beginning. But don’t worry. The next section gives you a brief introduction.

Idea and Concept of SPARQL

SPARQL is a query language for RDF databases. In contrast to relational databases like SQL, items are not part of any tables. Instead, items are linked with each other like a graph or network:

Wikidata RDS Database

To describe these relations, we can use a triple:

A triple is a statement containing a subject predicate and object.

Examples:

  • Germany (subject) has the capital (predicate) Berlin (object).
  • Berlin (subject) has the coordinates (predicate) 3.5million (object).
  • The European Union (subject) has the member (predicate) Germany (object).
  • Germany (subject) is a member of (predicate) the European Union (object).

You can come up with various statements to describe the graph above. And that is a huge benefit of SPARQL. You are not limited to a certain structure of relational databases and new information can be easily added. (If you want to dive deeper into the concept of SPARQL, I recommend this Youtube video (11min)).

How to query data from Wikidata?

To get data from Wikidata you simply use triples (like to one above) to write a SPARQL query. Let’s have a look how such a SPARQL query might look like. Note, that we are using specific identifiers to define the right relationship and item:

SELECT ?country
WHERE 
{
  ?country   wdt:P463     wd:Q458.
  #country   #member_of   #European_Union
}

Here, we simply ask for the countries that are part of the European Union.

Do you recognize the subject-predicate-object statement? We just select those countries, for which the condition holds: the country (?country) is a member of (wdt:P463) the European Union (wd:Q458).

Using the Wikidata Query Service as an endpoint gives us the following result:

Wikidata Output EU Members Screencast

Now, we only get the identifier codes of the member states back. To see the country names, we just use a label service and add it to our query:

SELECT ?country ?countryLabel
WHERE 
{
  ?country   wdt:P463          wd:Q458.
  SERVICE wikibase:label { bd:serviceParam wikibase:language
  "[AUTO_LANGUAGE],en". }
}

Wikidata Output EU Members

How simple is this? If you like to try it on your own, just follow this link.

How to get the correct identifiers?

For all queries, it is essential to identify the correct items and relations. For this purpose, Wikidata uses specific identifiers.

In the example above, I already looked them up: The relation “Being a member of” has the identifier wdt:P463 and the item “European Union” is identified by wd:Q458.

But how would you get them?

What I recommend is to inspect the Wikidata site of a result item. Knowing that France is a member of the European Union, I would inspect its Wikidata item:

1. Open France in Wikipedia to get to its Wikidata item:

2. Inspect the Wikidata item:

Here, you simply hover over the relationship “member of” and item “European Union” to get their identifier codes.

Solutions: (and more examples)

Do you remember the questions in the introduction? These are the queries you could use to answer them:

What is the capital city of every member of the European Union and how many inhabitants live there?

SELECT ?country ?countryLabel ?capitalLabel ?population 
WHERE 
{
  ?country wdt:P463 wd:Q458.
  ?country wdt:P36 ?capital.
  ?capital wdt:P1082 ?population.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "   [AUTO_LANGUAGE],en". }
}

How do the Nobel Prize winners in Physics look like?

#defaultView:ImageGrid
SELECT ?person ?personLabel ?image
WHERE 
{
  ?person wdt:P18 ?image;
          wdt:P166 wd:Q38104.
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Which countries use 112 as an emergency number?

#defaultView:Map
SELECT ?country ?countryLabel ?location
WHERE {
 ?country wdt:P2852 wd:Q1061257;
           wdt:P625 ?location.
  
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Interested in more?

I’m working on an online course about Wikidata. So if you are interested in more, leave your mailaddress and recevie a 25% coupon once the course starts.


This article has also been published at TowardsDataScience (Medium.com).