Difference between Dataset, Model, Graph, etc. in Jena

Note: I have taken this answer from this link. I hope to add some other relevant concepts to it.

  • A DataSet is a collection of models (one being the Default Model, any others being Named Models) that you expect will have new triples added to it over time. You can read and write on DataSets.
  • A Model is a collection of statements – this is what you typically aim your SPARQL queries at. If you SPARQL a Dataset and don’t use a ‘FROM NAMED’ clause, you’re querying the Default Model.
  • A Graph is a collection of triples. Every Model can be turned into a Graph, to provide a somewhat closer representation of the RDF, OWL, and SPARQL standards.
  • A DatasetGraph is a container for Graphs, that provides the infrastructure for Default and Named Graphs.
  • Some people prefer the DatasetGraph / Graph representation, which gives you a different suite of methods to call. It’s really a matter of preference – both the Model and Graph approaches will get the job done, though it seems to me that the Model Model approach is a bit higher-level and user-friendly.
  • A typical workflow for data analysis would have you identify a DataSource, like DBpedia. You’d define a bunch of different Datasets by querying DBpedia with CONSTRUCT statements. Now, you have static snapshots that you can use for your analysis work. In many Datasets, you’ll just have one model, the default model. However, sometimes you want the added complexity of Named Models, in which case your Datasets will have a few Models in each.
  • If you’re doing analysis, you usually want to set up DataSources (as proxies) to each repository you want to define your Datasets with. You’ll be in charge of persisting your Datasets, and can determine when to refresh them with new data by requiring your sources (if you even want to refresh your data). The persistence of Models, then, will come naturally as a result of the Dataset persistence.

Graphically, some of the above concepts can be depicted in the below diagram as:

Other relevant stuff can be found at the following links:

1. Criag Trim:  Working with DataSets using Jena

Setting Semantic Softwares on MAC OS X

In this post, I will be adding procedures required to set up necessary semantic softwares on Mac OS 10.10.3:

  • Setting OpenLink Virtuoso: For our purpose we use virtuoso as a triple store. Although, there are number of triple stores available but from my limited literature survey I found Virtuosos as fast, scalable and easy to set up. Steps involved from getting a copy of Virtuoso to its startup on Mac are as:
    1. Download a fresh copy of Virtuoso from link
    2. Unzip the copy at some suitable place
    3. Please read “README” file in the unzipped folder and ensure that you have all dependencies installed
    4. In this step we are going to change the default stack size, so in the same directory open binsrc/virtuoso/viunix.c in some text editor. In viunix.c, search for thread_initial and change its default size from 60000 to 80000. Save the file and exit
    5. Now open the “INSTALL” file in same unzipped folder. Read it carefully and follow the steps sequentially, i.e,  i) ./configure,  ii)make, iii) make check, and  iv) sudo make install
    6. Add location of virtuoso to PATH variable. There are various ways to do this, but here I will explain one of possible ways. Open. bash_profile file and add the following line to it “export PATH =$PATH:/usr/local/virtuoso-opensource/bin/“. Save the file and exit.
    7. Once all of the above steps are done successfully, now it is to check whether Virtuoso is installed successfully. Therefore, move to location of virtuoso.ini file. This file is mostly located at usr/local/virtuoso-opensource/var/lib/virtuoso/db
    8. Start server with command: sudo virtuoso-t -f &. On my system, server starts with sudo virtuoso-t -f
    9. Open http://localhost:8890/conductor in web browser. 

NOTE: These steps are also available at link, but she has missed step # 4. This is the only reason I created a separate post for this.

  • Setting Jena Framework: This is a java based framework. This is mostly used framework within semantic community, suitable for playing with semantic technologies. To setup jena on your system follow these steps:
    1. Download a fresh copy of apache-jean from apache download centre located at link
    2. Please follow the steps of integration of apache-jena with Eclipse at link
    3. To get a clear understanding of how ontologies/models are handled in Jena look at the documents present at link
  • Creating Ontologies: Ontologies serve as one of the basic component in semantic technologies. Ontologies define the basic schema/structure/relationship of a particular domain. There are various tools available for building ontologies. I am using Protege, developed at Stanford university. Download a copy of it for your system, and give it a try by following some tutorials available online.

Graph Databases and Relational Databases

Relational database (RD) stores data in the form of tables, whereas Graph database (GD) store data in the form of graphs. Triple stores (TS) are included in Graph Databases.  More specifically, TS are a type of NoSQL GD. However, TS differ from NoSQL Graph Databases in several ways. Triple stores have been implemented to store RDF, which is a special kind of graph: a directed labelled graph.  NoSQL Graph Databases can store different types of graphs: unlabeled graphs, undirected graphs, weighted graphs, hypergraphs, etc [Diversity]. Major points of difference between two include  GD and RD are [1]:

  • GD’s are occurrence based while as RD’s are schema based. This means for conventional applications where before hand we know the schema of stored data we should use RD while as for opposite scenarios we should use GD. GD gives us the flexibility of adding new data and relationships as encountered. Therefore, in RD we have a proper schema/structure of our database before hand, while as in GD no prior structure is fixed. 
  • Navigation: In GD you start with a root object and then traverse to related objects while as in RD navigation happens through Joins. Navigation via joins is always difficult.
  • RD are not good at recursion naturally, but this support is provided with some extensions. In contrast to this, GD are doing great in handling recursion. This is the beauty of graphs.

There are many differences, but at the end it boils down to your application. I think above first point should suffice your main question. Performance issues, always help you to decide whether to go for RD or GD. Some of the GD’s include: Neo4j, AllegroGraph, StarDog, OpenLink Virtuoso.                                    

    Some important points in regard to databases include:

  • Graphs can be implemented in a relational database using foreign keys, but it often needs link tables to model the complexities [2]. A difficulty with implementing triple stores over SQL is that although triples may thus be stored, implementing efficient querying of a graph-based RDF model (e.g., mapping from SPARQL) onto SQL queries is difficult [Jean]. Also, SPARQL, the standard query language used by triple stores, can require a lot of self joins, something RD’s are not optimised for [Michael].
  • Triple Stores, which are essentially GD but TS don’t store data in the form of graphs. Different types of TS on basis of their implementation include [Diversity]: 
  • Among GD’s, RDF database systems are only standardised at the moment.  These are build upon W3C’s Linked  Data technology stack [Arto].