Making Sense of 50 Billion Triples: Getting Started

In my last post I outlined some of the reasons why the promise of graph analytics takes thought and planning to really capitalize on its potential. Now, I’d like to focus on getting started, appropriately, at the beginning with understanding the data itself.

So let’s discuss methods of gaining an initial understanding of our data such that we can then feed that newfound understanding back into our ingestion process. First, we need to be able to query the database for some basic information that will tell us not only what is there but how the relationships are expressed. We would also, ideally, find out if there is a hierarchy to the relationships we find. By hierarchy, I mean that graphs can use containers, if you will, to organize information about different types of entities (e.g., events, people, places or things), thereby creating many more useful entries than what the basic triple allows us to describe.

To make sense of a new and unfamiliar dataset, we can run a series of exploratory queries that will help us understand the data’s macro structure and help us dig deeper into the data for even deeper understanding. In short, instead of waving my hands in the air and muttering “presto” under my breath while skipping over all the steps it took to get to those promised miraculous insights, I’ll go through some of the techniques I commonly use to understand a new and unfamiliar dataset. Bear in mind that few, if any, of these techniques are proprietary. Many of them are taken word for word from existing research on graph analytic techniques; others are adapted from the combination of many different techniques.

So, where to start? First, it makes sense to see how big the graph is. This is an easy and common-sense technique, so I will not belabor the point, but simply put the query below:

How many triples are there?

SELECT (COUNT(*) AS ?count)

WHERE {

?s ?p ?o .

}

How many distinct predicates are there, and how many objects are associated with each?

SELECT ?p (COUNT(?o) AS ?count)

WHERE {

?s ?p ?o .

}

GROUP BY ?p

ORDER BY ?p

50-billion-triples2

A count of all records is something of a given, but why count the predicates and how many values are associated with them? We can get a really quick idea of how complex our graph is as it relates to the types of relationships that exist.

Next, we want to explore this further. If an ontology exists, we should be able to see evidence of that in the predicates we just saw. In particular, we look for classes and subclasses:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?p ?o (COUNT(?o) AS ?count)

WHERE {

?s ?p ?o .

FILTER (?p IN(rdf:type,rdf:Class,rdfs:SubClassOf))

}

GROUP BY ?p ?o

ORDER BY ?count

LIMIT 10000

50-billion-triples3

Now we see a bit more structure in that we can now see what our ontology is. Now, it might be useful to see if our data has any implicit structure based on these findings:

SELECT ?type ?p (COUNT(?o) AS ?count)

WHERE {

?s a ?type .

?s ?p ?o .

}

GROUP BY ?type ?p

ORDER BY ?type

LIMIT 10000

50-billion-triples4

So, now we can see that there is indeed a more complex structure. Perhaps it might be useful to visualize this as a graph instead?

50-billion-triples5

Now that we know our dataset a bit better, let’s summarize. We took a dataset we knew nothing about and discovered the following:

The size of the graph
Complexity (g., how many dimensions)
The ontology
There is a hierarchy beyond just the simple graph structure
What properties are associated with each entity type

In our next installment, we can use the information we have gathered to dig deeper into the graph. Now that we know something about the types of relationships that exist, we can use that information to determine the overall structure of the graph from the standpoint of each type of entity.

Please feel free to comment or reach out to me personally if you have any questions about this work or would like to dig a bit deeper into the thought process. If you have come up with your own workflow for accomplishing similar goals, I would love to hear about it!

The post Making Sense of 50 Billion Triples: Getting Started appeared first on Cray Blog.

Making Sense of 50 Billion Triples: Getting Started

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List