In our graph series so far, we have explored what graph databases are and when they are valuable to use, as well as the Cray Graph Engine (“CGE”), a robust graph solution. For this last installment, we dive into how hardware affects the performance of a graph database.
Cray’s main product line, the XC™ series, is mostly used for scientific computing. From the point of view of an applications programmer, there is an important difference between scientific computing and the kind of computations done on a graph database. Programmers call it spatial locality. In a nutshell, if a computation has a lot of spatial locality, when a computation has to fetch some value from memory, the next value it’s going to need is usually stored nearby in the same memory. A modern processor’s hardware caches are designed to exploit spatial locality: If the value needed by the computation right now is in the cache, chances are that the next value the computation is going to need is also in the cache — because it was nearby and got loaded into the cache at the same time as the first value.
Scientific computations typically have high spatial locality, because it’s dictated by the physical phenomena that they’re simulating. This gives the scientific application programmer a major advantage in achieving high performance on a parallel system, because high spatial locality means that each processor can spend a large fraction of its time working on values stored in local memory, and only a small fraction of time exchanging data with other processors.
Graph databases, on the other hand, have little or no spatial locality. Consider your Facebook friends, for example. Many of them may live near you, but you’ve probably have some back in your home town, others across the country and probably a few halfway across the world. There wouldn’t be any good way to gather every Facebook user’s friends close together, unless they could all fit into the same memory. If the Facebook friends graph were to be processed as if it were a database, there is a very good chance that any edge that the program hopped across would bring it to a vertex located in a different processor’s memory.
Lack of spatial locality can hurt performance. Here’s the “10x” rule of thumb I use:
- If the next value your program needs to use is already in one of the processor’s registers, you can use it in one instruction from now.
- If it’s in the cache, you can use it in about 10 instructions from now.
- If it’s in this processor’s local memory, you can use it in 100 cycles from now.
- But if it’s in another processor’s memory and your program has to issue a remote fetch to bring it into this processor, it takes 1,000 instruction cycles.
Thus, a severe lack of spatial locality, as you typically see in graph computations, can cause the program to run very slowly, because processors spend a lot of their time idling, waiting for results to arrive. What we’ve done at Cray to mitigate the lack of spatial locality in graphs is to take into account another characteristic of graph computations: a high degree of parallelism. If you need to search a huge graph, you can use a huge number of fine-grain program threads running in parallel, and have them searching in lots of different places at once. Even if each remote fetch takes 1,000 instruction cycles to get the data value where it’s needed, if each processor can keep close to 1,000 remote memory requests in flight at once, then the processors are going to stay pretty busy — because, on average, a needed value is going to be showing up every instruction cycle.
Our first graph database solution, the Urika®-GD system, uses a custom “multithreaded” processor to keep lots of remote memory requests in flight at any time. CGE combines hardware and software to achieve good performance on graph computations. On the hardware side, we use the Cray Aries™ interconnect. It’s a high-bandwidth, low-latency interprocessor communication network, and it has a characteristic that’s especially good for low-locality problems: It’s efficient at one-word remote memory fetches and stores. Most commodity networks assume high spatial locality and fetch a whole cache line at once. For low-locality computations on graphs, that means that around 80 percent of the network bandwidth will be wasted transporting nearly empty packets. On the software side, we emulate multithreading in software in order to keep lots of remote fetches in flight at once.
The result of this extra attention we pay to the low-spatial-locality nature of graph computations is that our performance is significantly better, especially on complex queries (which tend to exacerbate the lack of locality), than our competitors — and we scale much larger than they do.
Here’s an example. Freiburg University in Germany published the SPARQL Performance Benchmark (SP2B). We looked at one of the more complex queries in their benchmark, Query 2, and ran it on CGE and two fairly prominent competitors using an eight-processor system. SP2B uses a synthetic database, which you can generate to be whatever size you want to test.
- When we ran Query 2 with 1 million RDF triples (e., a graph with 1 million edges), it took CGE a little over a second. Competitor B was actually a little faster, at about one second flat. Competitor A was pushing 30 seconds.
- When we ran the query against 25 million triples, CGE took around 2.5 seconds. Competitor A dropped out; we couldn’t load that much data into it. Competitor B took just below 10 seconds.
- When we ran the query against 50 million triples, we took a little under four seconds. Neither Competitor A nor Competitor B could load that much data.
Two comments about this comparison:
- Am I worried about being beaten on 1 million triples? Not so much. If your databases aren’t going to be bigger than a million triples, you don’t need a Cray system. Use your laptop.
- These tests were run using eight processor nodes of a Cray cluster. CGE scales to a lot more than eight processor nodes. We regularly test on datasets with multiple billions of triples on larger configurations, approximately 20 to 30 nodes.
In conclusion, the Cray Graph Engine gives you a way to analyze a graph in ways that include pattern matching and filtering (standard SPARQL), sophisticated graph analysis (our INVOKE extension and our graph algorithm library), in a powerful, interactive system that scales to graphs with billions of edges.
I hope this blog series spurs you to explore more about how you could use graph databases to solve your complex data analysis problems. And if your problems are large scale, definitely evaluate Cray Graph Engine as a possible solution.
Check out this video to learn more about the Cray Graph Engine.
The post How CGE Achieves High Performance and Scalability appeared first on Cray Blog.