DBpedia SPARQL Benchmark
http://aksw.org/Projects/DBPSB an entity of type: AlumniProject
DBPSB is a general SPARQL benchmark procedure, which we apply to the DBpedia knowledge base. The benchmark is based on query-log mining, clustering and SPARQL feature analysis. In contrast to other benchmarks, we perform measurements on actually posed queries against existing RDF data. xsd:string |
DBpedia | |
The project has been deprecated and work on a predecessor is ongoing: https://github.com/AKSW/mosquito. Triple stores are the backbone of increasingly many Data Web applications. It is evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Web in general. Assessing the performance of current triple stores is, therefore, important to observe weaknesses and strengths of current implementations. DBPSB is a general SPARQL benchmark procedure, which we apply to the DBpedia knowledge base. The benchmark is based on query-log mining, clustering and SPARQL feature analysis. In contrast to other benchmarks, we perform measurements on actually posed queries against existing RDF data. Previous approaches often compared relational and triple stores and, thus, settled on measuring performance against a relational database, which has been converted to RDF, using SQL like queries. We argue that a pure RDF benchmark is more useful to compare between existing triple stores and provide results for Virtuoso, Sesame, Jena-TDB, and BigOWLIM. Here we provide an overview of steps required to create the benchmark. The methodology can in principle be applied to all RDF knowledge bases. It allows the benchmark to be updated as the knowledge bases and queries to it evolve Dataset GenerationBase Data: DBpedia 3.5.1 with all data sets mentioned are available here. In order to generate a dataset of specific size, do the following steps:
Generating data using Random Triple method is much faster than generation by Random Instance. Datasets are available for download here. Query GenerationQuery Log: here. In order sort the query log by frequency of queries, do the following steps: -
The approximate time of that step is 2.5 hours. Clustered Query Log:
List of Benchmark Queries: To avoid caching of query results we should introduce a small difference in each run of the query in order to force the triplestore not fetch the query result from its cache. The required steps to get our set of queries are as follows: -
The time for selecting the queries and their auxiliary queries is approx. 3 days. Sample QueryFollowing is a sample query with a variable part, that will be used as a placeholder during the hot run phase. The placeholder is indicated by %%v%%. SELECT * WHERE { {?v2 a dbp-owl:Settlement; rdfs:label %%v%%. ?v6 a dbp-owl:Airport.} {?v6 dbp-owl:city ?v2.} UNION {?v6 dbp-owl:location ?v2.} {?v6 dbp-prop:iata ?v5.} UNION {?v6 dbp-owl:iataLocationIdentifier ?v5.} OPTIONAL {?v6 foaf:homepage ?v7.} OPTIONAL {?v6 dbp-prop:nativename ?v8.} } We use another query called the auxiliary query in order to get a list of possible values for that placeholder. During the hot-run phase, the application selects a random value out of the list of possible values of the placeholder. The auxiliary query used to fill that list is as follows: SELECT DISTINCT ?v WHERE{ {?v2 a dbp-owl:Settlement; rdfs:label ?v. ?v6 a dbp-owl:Airport.} {?v6 dbp-owl:city ?v2.} UNION {?v6 dbp-owl:location ?v2.} {?v6 dbp-prop:iata ?v5.} UNION {?v6 dbp-owl:iataLocationIdentifier ?v5.} OPTIONAL {?v6 foaf:homepage ?v7.} OPTIONAL {?v6 dbp-prop:nativename ?v8.} } LIMIT 1000 Benchmark ExecutionLoading Procedures: In order to upload data into the 3 different triplestores do a step of the following: -
For loading dataset of size 100% into virtuoso it takes approx 10 Hrs, 8 Hrs for Jena-TDB, 14 Hrs for Sesame, and 8 Hrs for Big OWLIM. Benchmark ProceduresThere are 4 classes called 'VirtuosoQueryExecutor', 'JenaTDBQueryExecutor', 'SesameQueryExecutor', and 'BigOWLIMQueryExecutor' for each type of triplestore, each one of them contains a function called 'executeQuery' that takes the SPARQL query as parameter and returns the execution time of that query against the triplestore of interest in micro-seconds. This function is called within a loop that works for 20 minutes for warm-up, and then for 60 minutes for actual calculation . Benchmark MetricsThe main metrics used in DBPSB for performance measurement are:
sysont:Markdown |
|
<http://akswbenchmark.svn.sourceforge.net/> | |
Mohamed Morsey | |
<http://wiki.aksw.org/Projects/DBPSB> |
inverse relations
1 resourcesMohamed Morsey |