Authors: Andra Waagmeester, Katherine Thornton and @addshore
Wikidata is the linked database maintained by the Wikimedia Foundation, which also maintains Wikipedia. As an online encyclopedia, Wikipedia is one of the most visited websites worldwide. Started in 2001 as an English website it currently has versions in more than 290 languages and is still growing. All these (connected) Wikipedias are running on a software platform called Mediawiki. This Mediawiki stack has been released as open software to general public and has since been used in many, many cases outside the Wikimedia ecosystem.
Although the different language Wikipedias are connected with site links, the linking is done on a page level. The details within those pages need to be maintained by the individual language communities. This means that when things change, these changes need to be adapted in the many languages. This is a lot of redundancy especially since a lot of that change is usually a simple change in non-language specific details. The worst-case scenario is that a single update needs at least 290 Wikipedians adding the exact same thing. This parallel editing between different language communities leads to incompleteness or differences on the facts reported.
In the example below, more disease information is reported in the English Wikipedia, compared to the Greek and Dutch Wikipedia. On the population size of Aruba, numbers differ between different languages.
This is where Wikidata enters the equation. Wikidata is the linked database of the Wikimedia Foundation that intends to feed all the available projects by the Wikimedia Foundation. This is next to Wikipedia also projects like commons, a repository of free images, videos, sounds, and other multimedia files. Linked databases are resources using web addresses to allow easy linking between different - geographically dispersed - linked databases. This is nicely illustrated by the linked data cloud, that shows the connectivity between resources from all known domains.
Since its inception 2012, Wikidata has become a leading infrastructure for hosting linked data, making it a key component in the Semantic Web. It has become a hub linking not only knowledge captured in Wikipedia but also with knowledge outside that ecosystem. The gene wiki project is an example project where researchers in the life sciences collaborate to combine and open biomedical knowledge for the larger public.
But, there is always a but, not everything goes into Wikidata. First criteria is that it contains only public data. It’s noteworthy that open data is not necessarily public data. Wikidata comes with, what is called, a CC0 license waiver, which waives any rights to the content added.
Furthermore, when a concept is described in Wikidata it needs to be noteworthy. Something is noteworthy if it fits at least one of the following three criteria. (1) It is being described in any of the existing Wikipedia, (2) it is being described by referenceable and trustworthy resources and (3) it fits a structural need. Details on the notability constraints can be found here.
Personally, I would use a 4th criteria and that the total number of concepts in a specific domain should not exceed more than 20% of Wikidata. It is a general knowledge base and populating this with, for example, all known galaxies would instantly make it more a knowledge base on our universe, than a knowledge base that fits, which it currently is. The most conservative estimate of existing galaxies is about 200 million, while currently, the Wikidata contains 50 million items. Yet, there are compelling reasons to add knowledge about galaxies or similar size domains to a linked database. Also if the data is not available in a public license, we still want to be able to link those closed or hidden facts with a knowledgebase like Wikidata.
Here the OpenStack of Fuga Cloud can be instrumental. Wikidata is running on a stack of which the core is called Wikibase. It also contains an API and a WDQS. And different systems exist to batch upload data to Wikidata. Most of the stack on which Wikidata runs can also be run through Docker now. This means that when your hunger for knowledge exceeds what Wikidata provides, it is now possible to run on yourself. Through a federated query language called SPARQL, it is possible to seamlessly integrate a custom Wikibase Docker install with Wikidata or other Wikibase installs. With the release of this Docker file for Wikibase, many are exploring how to spin up a personal or organisational Wikibase.
We are currently running various of these Docker spun Wikibases on Fuga Cloud. It is as simple as:
This article is written on 2018-09-17. We recommend to read the Getting Started guide for the most up to date flow.
- Create an account, if you don’t have one already
- Add billing details
- Click the button to setup horizon
- Navigate to “Project”
- >> Compute
- >> Instances
- >> Launch Instance
- Name: wikibase1, Any availability zone, count 1
- Source, latest ubuntu (currently 18.04)
- Flavour: c1.large (medium should also work)
- Key pair, create one and download the PEM :)
- >> Launch Instance
Add a public IP:
- From the instances page, click on the actions menu on the right hand side & click Associate floating IP
- Click the + button to add a new IP
- Click Allocate IP && Associate
Add firewall rules:
- Navigate to Project >> Compute >> Access & Security >> Security Groups
- Click on “Manage Rules” for the default group
- Allow ingress on port 22 (for ssh) for your IP (or all IPs, depending on how secure you want to be)
- Allow ingress on the ports for wikibase & query service UI access (the defaults at 8181 and 8282) from all IPs 0.0.0.0/0
Login to the instance:
ssh email@example.com # where the IP is the public IP you assigned
sudo apt-get update
- Install Docker, follow the instructions on https://docs.docker.com/install/linux/docker-ce/ubuntu/
- Install Docker Compose, follow the instructions on https://docs.docker.com/compose/install
Get your Docker Compose
- In the future there will be a lovely UI that you can download a docker-compose.yml from
- Currently you have to look at the example and work from that replacing various default values @ https://github.com/wmde/wikibase-docker/blob/master/docker-compose.yml
- All values that should be replaced are included next to “CONFIG” comments in the yaml
wget https://raw.githubusercontent.com/wmde/wikibase-docker/master/docker-compose.yml # Make your modifications, or just download your modified file
sudo docker-compose up -d #Wait for a few seconds
sudo docker-compose ps # check the status
Upon going through this cycle, there will be an empty Wikibase ready for use.
When this screen is visible, it is ready to receive input. Given it runs on the same infrastructure as Wikidata, linked data can be added using the same data model allowing for better integration between Wikidata and the freshly installed Wikibase instance. We are now able to populate this Wikibase instance running on the Fuga Cloud server, using the same Python framework, we use to add knowledge to Wikidata.
Upon adding data to the local Wikibase, we were able to query between our own Wikibase, Wikidata and an external scientific database (UniProt). This allows us to ask questions of our data in novel ways. Members of the Wikidata community maintain mappings between Wikidata and UniProt. The portion of the query below, that is highlighted by the blue rectangle, shows how Wikidata is used to store UniProt mappings. Thus we can ask questions about our data, in combination with cross-domain data in Wikidata, and the external UniProt data. In the screendump below, it is one query submitted to our own Wikibase, that returns this integrated knowledge from the two remote Semantic web servers.
Running a Wikibase on an OpenStack cloud enables the integration of this local data with external points using a single API. With this interface users can skip over many of the configuration steps for the Wikibase instance they create. By including the API and the Wikidata Query Service in the image, these instances of Wikibase are preconfigured to interoperate with current Wikidata infrastructure.