apache tinkerpop logo

3.2.2-SNAPSHOT

Recipes

gremlin-chef All programming languages tend to have patterns of usage for commonly occurring problems. Gremlin is not different in that respect. There are many commonly occurring traversal themes that have general applicability to any graph. Gremlin Recipes present these common traversal patterns and methods of usage that will provide some basic building blocks for virtually any graph in any domain.

Recipes assume general familiarity with Gremlin and the TinkerPop stack. Be sure to have read the Getting Started tutorial and the The Gremlin Console tutorial.

Traversal Recipes

Between Vertices

It is quite common to have a situation where there are two particular vertices of a graph and a need to execute some traversal on the paths found between them. Consider the following examples:

gremlin> g.V(1).bothE() //(1)
==>e[9][1-created->3]
==>e[7][1-knows->2]
==>e[8][1-knows->4]
gremlin> g.V(1).bothE().where(otherV().hasId(2)) //(2)
==>e[7][1-knows->2]
gremlin> v1 = g.V(1).next();[]
gremlin> v2 = g.V(2).next();[]
gremlin> g.V(v1).bothE().where(otherV().is(v2)) //(3)
==>e[7][1-knows->2]
gremlin> g.V(v1).outE().where(inV().is(v2)) //(4)
==>e[7][1-knows->2]
gremlin> g.V(1).outE().where(inV().has(id, within(2,3))) //(5)
==>e[9][1-created->3]
==>e[7][1-knows->2]
gremlin> g.V(1).out().where(__.in().hasId(6)) //(6)
==>v[3]
  1. There are three edges from the vertex with the identifier of "1".

  2. Filter those three edges using the where() step using the identifier of the vertex returned by otherV() to ensure it matches on the vertex of concern, which is the one with an identifier of "2".

  3. Note that the same traversal will work if there are actual Vertex instances rather than just vertex identiers.

  4. The vertex with identifier "1" has all outgoing edges, so it would also be acceptable to use the directional steps of outE() and inV() since the schema allows it.

  5. There is also no problem with filtering the terminating side of the traversal on multiple vertices, in this case, vertices with identifiers "2" and "3".

  6. There’s no reason why the same pattern of exclusion used for edges with where() can’t work for a vertex between two vertices.

The basic pattern of using where() step to find the "other" known vertex can be applied in far more complex scenarios. For one such example, consider the following traversal that finds all the paths between a group of defined vertices:

gremlin> ids = [2,4,6].toArray()
==>2
==>4
==>6
gremlin> g.V(ids).as("a").
           repeat(bothE().otherV().simplePath()).times(5).emit(hasId(within(ids))).as("b").
           filter(select(last,"a","b").by(id).where("a", lt("b"))).
           path().by().by(label)
==>[v[2],knows,v[1],knows,v[4]]
==>[v[2],knows,v[1],created,v[3],created,v[4]]
==>[v[2],knows,v[1],created,v[3],created,v[6]]
==>[v[2],knows,v[1],knows,v[4],created,v[3],created,v[6]]
==>[v[4],created,v[3],created,v[6]]
==>[v[4],knows,v[1],created,v[3],created,v[6]]

For another example, consider the following schema:

recipe-job-schema

Assume that the goal is to find information about a known job and a known person. Specifically, the idea would be to extract the known job, the company that created the job, the date it was created by the company and whether or not the known person completed an application.

gremlin> vBob = graph.addVertex(label, "person", "name", "bob")
==>v[0]
gremlin> vStephen = graph.addVertex(label, "person", "name", "stephen")
==>v[2]
gremlin> vBlueprintsInc = graph.addVertex(label, "company", "name", "Blueprints, Inc")
==>v[4]
gremlin> vRexsterLlc = graph.addVertex(label, "company", "name", "Rexster, LLC")
==>v[6]
gremlin> vBlueprintsJob1 = graph.addVertex(label, "job", "name", "job1")
==>v[8]
gremlin> vBlueprintsJob2 = graph.addVertex(label, "job", "name", "job2")
==>v[10]
gremlin> vBlueprintsJob3 = graph.addVertex(label, "job", "name", "job3")
==>v[12]
gremlin> vRexsterJob1 = graph.addVertex(label, "job", "name", "job4")
==>v[14]
gremlin> vAppBob1 = graph.addVertex(label, "application", "name", "application1")
==>v[16]
gremlin> vAppBob2 = graph.addVertex(label, "application", "name", "application2")
==>v[18]
gremlin> vAppStephen1 = graph.addVertex(label, "application", "name", "application3")
==>v[20]
gremlin> vAppStephen2 = graph.addVertex(label, "application", "name", "application4")
==>v[22]
gremlin> vBob.addEdge("completes", vAppBob1)
==>e[24][0-completes->16]
gremlin> vBob.addEdge("completes", vAppBob2)
==>e[25][0-completes->18]
gremlin> vStephen.addEdge("completes", vAppStephen1)
==>e[26][2-completes->20]
gremlin> vStephen.addEdge("completes", vAppStephen2)
==>e[27][2-completes->22]
gremlin> vAppBob1.addEdge("appliesTo", vBlueprintsJob1)
==>e[28][16-appliesTo->8]
gremlin> vAppBob2.addEdge("appliesTo", vBlueprintsJob2)
==>e[29][18-appliesTo->10]
gremlin> vAppStephen1.addEdge("appliesTo", vRexsterJob1)
==>e[30][20-appliesTo->14]
gremlin> vAppStephen2.addEdge("appliesTo", vBlueprintsJob3)
==>e[31][22-appliesTo->12]
gremlin> vBlueprintsInc.addEdge("created", vBlueprintsJob1, "creationDate", "12/20/2015")
==>e[32][4-created->8]
gremlin> vBlueprintsInc.addEdge("created", vBlueprintsJob2, "creationDate", "12/15/2015")
==>e[33][4-created->10]
gremlin> vBlueprintsInc.addEdge("created", vBlueprintsJob3, "creationDate", "12/16/2015")
==>e[34][4-created->12]
gremlin> vRexsterLlc.addEdge("created", vRexsterJob1, "creationDate", "12/18/2015")
==>e[35][6-created->14]
gremlin> g.V(vRexsterJob1).as('job').
           inE('created').as('created').
           outV().as('company').
           select('job').
           coalesce(__.in('appliesTo').where(__.in('completes').is(vStephen)),
                    constant(false)).as('application').
           select('job', 'company', 'created', 'application').
             by().by().by('creationDate').by()
==>[job:v[14],company:v[6],created:12/18/2015,application:v[20]]
gremlin> g.V(vRexsterJob1, vBlueprintsJob1).as('job').
           inE('created').as('created').
           outV().as('company').
           select('job').
           coalesce(__.in('appliesTo').where(__.in('completes').is(vBob)),
                    constant(false)).as('application').
           select('job', 'company', 'created', 'application').
             by().by().by('creationDate').by()
==>[job:v[14],company:v[6],created:12/18/2015,application:false]
==>[job:v[8],company:v[4],created:12/20/2015,application:v[16]]

While the traversals above are more complex, the pattern for finding "things" between two vertices is largely the same. Note the use of the where() step to terminate the traversers for a specific user. It is embedded in a coalesce() step to handle situations where the specified user did not complete an application for the specified job and will return false in those cases.

Shortest Path

shortest-path

When working with a graph, it is often necessary to identify the shortest path between two identified vertices. The following is a simple example that identifies the shortest path between vertex "1" and vertex "5" while traversing over out edges:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> v1 = graph.addVertex(T.id, 1)
==>v[1]
gremlin> v2 = graph.addVertex(T.id, 2)
==>v[2]
gremlin> v3 = graph.addVertex(T.id, 3)
==>v[3]
gremlin> v4 = graph.addVertex(T.id, 4)
==>v[4]
gremlin> v5 = graph.addVertex(T.id, 5)
==>v[5]
gremlin> v1.addEdge("knows", v2)
==>e[0][1-knows->2]
gremlin> v2.addEdge("knows", v4)
==>e[1][2-knows->4]
gremlin> v4.addEdge("knows", v5)
==>e[2][4-knows->5]
gremlin> v2.addEdge("knows", v3)
==>e[3][2-knows->3]
gremlin> v3.addEdge("knows", v4)
==>e[4][3-knows->4]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:5 edges:5], standard]
gremlin> g.V(1).repeat(out().simplePath()).until(hasId(5)).path().limit(1) //(1)
==>[v[1],v[2],v[4],v[5]]
gremlin> g.V(1).repeat(out().simplePath()).until(hasId(5)).path().count(local) //(2)
==>4
==>5
gremlin> g.V(1).repeat(out().simplePath()).until(hasId(5)).path().
           group().by(count(local)).next() //(3)
==>4=[[v[1], v[2], v[4], v[5]]]
==>5=[[v[1], v[2], v[3], v[4], v[5]]]
  1. The traversal starts at vertex with the identifier of "1" and repeatedly traverses on out edges "until" it finds a vertex with an identifier of "5". The inclusion of simplePath within the repeat is present to filter out repeated paths. The traversal terminates with limit in this case as the first path returned will be the shortest one. Of course, it is possible for there to be more than one path in the graph of the same length (i.e. two or more paths of length three), but this example is not considering that.

  2. It might be interesting to know the path lengths for all paths between vertex "1" and "5".

  3. Alternatively, one might wish to do a path length distribution over all the paths.

The previous example defines the length of the path by the number of vertices in the path, but the "path" might also be measured by data within the graph itself. The following example use the same graph structure as the previous example, but includes a "weight" on the edges, that will be used to help determine the "cost" of a particular path:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> v1 = graph.addVertex(T.id, 1)
==>v[1]
gremlin> v2 = graph.addVertex(T.id, 2)
==>v[2]
gremlin> v3 = graph.addVertex(T.id, 3)
==>v[3]
gremlin> v4 = graph.addVertex(T.id, 4)
==>v[4]
gremlin> v5 = graph.addVertex(T.id, 5)
==>v[5]
gremlin> v1.addEdge("knows", v2, "weight", 1.25)
==>e[0][1-knows->2]
gremlin> v2.addEdge("knows", v4, "weight", 1.5)
==>e[1][2-knows->4]
gremlin> v4.addEdge("knows", v5, "weight", 0.25)
==>e[2][4-knows->5]
gremlin> v2.addEdge("knows", v3, "weight", 0.25)
==>e[3][2-knows->3]
gremlin> v3.addEdge("knows", v4, "weight", 0.25)
==>e[4][3-knows->4]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:5 edges:5], standard]
gremlin> g.V(1).repeat(out().simplePath()).until(hasId(5)).path().
           group().by(count(local)).next() //(1)
==>4=[[v[1], v[2], v[4], v[5]]]
==>5=[[v[1], v[2], v[3], v[4], v[5]]]
gremlin> g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
           path().by(coalesce(values('weight'),
                              constant(0.0))).
           map(unfold().sum()) //(2)
==>3.00
==>2.00
gremlin> g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
           path().by(constant(0.0)).by('weight').map(unfold().sum()) //(3)
==>3.00
==>2.00
gremlin> g.V(1).repeat(outE().inV().simplePath()).until(hasId(5)).
           path().as('p').
           map(unfold().coalesce(values('weight'),
                                 constant(0.0)).sum()).as('cost').
           select('cost','p') //(4)
==>[cost:3.00,p:[v[1],e[0][1-knows->2],v[2],e[1][2-knows->4],v[4],e[2][4-knows->5],v[5]]]
==>[cost:2.00,p:[v[1],e[0][1-knows->2],v[2],e[3][2-knows->3],v[3],e[4][3-knows->4],v[4],e[2][4-knows->5],v[5]]]
  1. Note that the shortest path as determined by the structure of the graph is the same.

  2. Calculate the "cost" of the path as determined by the weight on the edges. As the "weight" data is on the edges between the vertices, it is necessary to change the contents of the repeat step to use outE().inV() so that the edge is included in the path. The path is then post-processed with a by modulator that extracts the "weight" value. The traversal uses coalesce as there is a mixture of vertices and edges in the path and the traversal is only interested in edge elements that can return a "weight" property. The final part of the traversal executes a map function over each path, unfolding it and summing the weights.

  3. The same traversal as the one above it, but avoids the use of coalesce with the use of two by modulators. The by modulator is applied in a round-robin fashion, so the first by will always apply to a vertex (as it is the first item in every path) and the second by will always apply to an edge (as it always follows the vertex in the path).

  4. The output of the previous examples of the "cost" wasn’t terribly useful as it didn’t include which path had the calculated cost. With some slight modifications given the use of select it becomes possible to include the path in the output. Note that the path with the lowest "cost" actually has a longer path length as determined by the graph structure.

If-Then Based Grouping

Consider the following traversal over the "modern" toy graph:

gremlin> g.V().hasLabel('person').groupCount().by('age')
==>[32:1,35:1,27:1,29:1]

The result is an age distribution that simply shows that every "person" in the graph is of a different age. In some cases, this result is exactly what is needed, but sometimes a grouping may need to be transformed to provide a different picture of the result. For example, perhaps a grouping on the value "age" would be better represented by a domain concept such as "young", "old" and "very old".

gremlin> g.V().hasLabel("person").groupCount().by(values("age").choose(
           is(lt(28)),constant("young"),
           choose(is(lt(30)),
                  constant("old"),
                  constant("very old"))))
==>[young:1,old:1,very old:2]

Note that the by modulator has been altered from simply taking a string key of "age" to take a Traversal. That inner Traversal utilizes choose which is like an if-then-else clause. The choose is nested and would look like the following in Java:

if (age < 28) {
  return "young";
} else {
  if (age < 30) {
    return "old";
  } else {
    return "very old";
  }
}

The use of choose is a good intutive choice for this Traversal as it is a natural mapping to if-then-else, but there is another option to consider with coalesce:

gremlin> g.V().hasLabel("person").
           groupCount().by(values("age").
           coalesce(is(lt(28)).constant("young"),
                    is(lt(30)).constant("old"),
                    constant("very old")))
==>[young:1,old:1,very old:2]

The answer is the same, but this traversal removes the nested choose, which makes it easier to read.

Cycle Detection

A cycle occurs in a graph where a path loops back on itself to the originating vertex. For example, in the graph depticted below Gremlin could be use to detect the cycle among vertices A-B-C.

graph-cycle

gremlin> vA = graph.addVertex(id, 'a')
==>v[a]
gremlin> vB = graph.addVertex(id, 'b')
==>v[b]
gremlin> vC = graph.addVertex(id, 'c')
==>v[c]
gremlin> vD = graph.addVertex(id, 'd')
==>v[d]
gremlin> vA.addEdge("knows", vB)
==>e[0][a-knows->b]
gremlin> vB.addEdge("knows", vC)
==>e[1][b-knows->c]
gremlin> vC.addEdge("knows", vA)
==>e[2][c-knows->a]
gremlin> vA.addEdge("knows", vD)
==>e[3][a-knows->d]
gremlin> vC.addEdge("knows", vD)
==>e[4][c-knows->d]
gremlin> g.V().as("a").repeat(out().simplePath()).times(2).
           where(out().as("a")).path() //(1)
==>[v[a],v[b],v[c]]
==>[v[b],v[c],v[a]]
==>[v[c],v[a],v[b]]
gremlin> g.V().as("a").repeat(out().simplePath()).times(2).
           where(out().as("a")).path().
           dedup().by(unfold().order().by(id).dedup().fold()) //(2)
==>[v[a],v[b],v[c]]
  1. Gremlin starts its traversal from a vertex labeled "a" and traverses out() from each vertex filtering on the simplePath, which removes paths with repeated objects. The steps going out() are repeated twice as in this case the length of the cycle is known to be three and there is no need to exceed that. The traversal filters with a where() to see only return paths that end with where it started at "a".

  2. The previous query returned the A-B-C cycle, but it returned three paths which were all technically the same cycle. It returned three, because there was one for each vertex that started the cycle (i.e. one for A, one for B and one for C). This next line introduce deduplication to only return unique cycles.

The above case assumed that the need was to only detect cycles over a path length of three. It also respected the directionality of the edges by only considering outgoing ones. What would need to change to detect cycles of arbitrary length over both incoming and outgoing edges in the modern graph?

gremlin> g.V().as("a").repeat(both().simplePath()).emit(loops().is(gt(1))).
           both().where(eq("a")).path().
           dedup().by(unfold().order().by(id).dedup().fold())
==>[v[1],v[3],v[4],v[1]]

Centrality

There are many measures of centrality which are meant to help identify the most important vertices in a graph. As these measures are common in graph theory, this section attempts to demonstrate how some of these different indicators can be calculated using Gremlin.

Degree Centrality

Degree centrality is a measure of the number of edges associated to each vertex.

gremlin> g.V().group().by().by(bothE().count()) //(1)
==>[v[1]:3,v[2]:1,v[3]:3,v[4]:3,v[5]:1,v[6]:1]
gremlin> g.V().group().by().by(inE().count()) //(2)
==>[v[1]:0,v[2]:1,v[3]:3,v[4]:1,v[5]:1,v[6]:0]
gremlin> g.V().group().by().by(outE().count()) //(3)
==>[v[1]:3,v[2]:0,v[3]:0,v[4]:2,v[5]:0,v[6]:1]
gremlin> g.V().project("v","degree").by().by(bothE().count()) //(4)
==>[v:v[1],degree:3]
==>[v:v[2],degree:1]
==>[v:v[3],degree:3]
==>[v:v[4],degree:3]
==>[v:v[5],degree:1]
==>[v:v[6],degree:1]
gremlin> g.V().project("v","degree").by().by(bothE().count()). //(5)
           order().by(select("degree"), decr).
           limit(4)
==>[v:v[1],degree:3]
==>[v:v[3],degree:3]
==>[v:v[4],degree:3]
==>[v:v[2],degree:1]
  1. Calculation of degree centrality which counts all incident edges on each vertex to include those that are both incoming and outgoing.

  2. Calculation of in-degree centrality which only counts incoming edges to a vertex.

  3. Calculation of out-degree centrality which only counts outgoing edges from a vertex.

  4. The previous examples all produce a single Map as their output. While that is a desireable output, producing a stream of Map objects can allow some greater flexibility.

  5. For example, use of a stream enables use of an ordered limit that can be executed in a distributed fashion in OLAP traversals.

Note
The group step takes up to two separate by modulators. The first by() tells group() what the key in the resulting Map will be (i.e. the value to group on). In the above examples, the by() is empty and as a result, the grouping will be on the incoming Vertex object itself. The second by() is the value to be stored in the Map for each key.

Betweeness Centrality

Betweeness centrality is a measure of the number of times a vertex is found between the shortest path of each vertex pair in a graph. Consider the following graph for demonstration purposes:

betweeness-example

gremlin> a = graph.addVertex('name','a')
==>v[0]
gremlin> b = graph.addVertex('name','b')
==>v[2]
gremlin> c = graph.addVertex('name','c')
==>v[4]
gremlin> d = graph.addVertex('name','d')
==>v[6]
gremlin> e = graph.addVertex('name','e')
==>v[8]
gremlin> a.addEdge('next',b)
==>e[10][0-next->2]
gremlin> b.addEdge('next',c)
==>e[11][2-next->4]
gremlin> c.addEdge('next',d)
==>e[12][4-next->6]
gremlin> d.addEdge('next',e)
==>e[13][6-next->8]
gremlin> g.withSack(0).V().store("x").repeat(both().simplePath()).emit().path(). //(1)
           group().by(project("a","b").by(limit(local, 1)). //(2)
                                       by(tail(local, 1))).
                   by(order().by(count(local))). //(3)
                   select(values).as("shortestPaths"). //(4)
                   select("x").unfold().as("v"). //(5)
                   select("shortestPaths"). //(6)
                     map(unfold().filter(unfold().where(eq("v"))).count()). //(7)
                     sack(sum).sack().as("betweeness"). //(8)
                   select("v","betweeness")
==>[v:v[0],betweeness:8]
==>[v:v[2],betweeness:14]
==>[v:v[4],betweeness:16]
==>[v:v[6],betweeness:14]
==>[v:v[8],betweeness:8]
  1. Defines a Gremlin sack with a value of zero, which represents the initial betweeness score for each vertex, and traverses on both incoming and outgoing edges avoiding cyclic paths.

  2. Group each path by the first and last vertex.

  3. Reduce the list of paths to the shortest path between the first and last vertex by ordering on their lengths.

  4. Recall that at this point, there is a Map keyed by first and last vertex and with a value of just the shortest path. Extract the shortest path with select(values), since that’s the only portion required for the remainder of the traversal.

  5. The "x" key contains the list of vertices stored from step 1 - unfold that list into "v" for later use. This step will unwrap the vertex that is stored in the Traverser as BulkSet so that it can be used directly in the Traversal.

  6. Iterate the set of shortest paths. At this point, it is worth noting that the traversal is iterating each vertex in "v" and for each vertex in "v" it is iterating each Path in "shortestpaths".

  7. For each path, transform it to a count of the number of times that "v" from step 5 is encountered.

  8. Sum the counts for each vertex using sack(), normalize the value and label it as the "betweeness" to be the score.

Closeness Centrality

Closeness centrality is a measure of the distance of one vertex to all other reachable vertices in the graph.

gremlin> g.withSack(1f).V().repeat(both().simplePath()).emit().path(). //(1)
           group().by(project("a","b").by(limit(local, 1)). //(2)
                                       by(tail(local, 1))).
                   by(order().by(count(local))). //(3)
           select(values).unfold(). //(4)
           project("v","length").
             by(limit(local, 1)). //(5)
             by(count(local).sack(div).sack()). //(6)
           group().by(select("v")).by(select("length").sum()) //(7)
==>[v[1]:2.1666666666666665,v[2]:1.6666666666666665,v[3]:2.1666666666666665,v[4]:2.1666666666666665,v[5]:1.6666666666666665,v[6]:1.6666666666666665]
  1. Defines a Gremlin sack with a value of one, and traverses on both incoming and outgoing edges avoiding cyclic paths.

  2. Group each path by the first and last vertex.

  3. Reduce the list of paths to the shortest path between the first and last vertex by ordering on their lengths.

  4. Recall that at this point, there is a Map keyed by first and last vertex and with a value of just the shortest path. Extract the shortest path with select(values), since that’s the only portion required for the remainder of the traversal.

  5. The first by() modulator for project() extracts the first vertex in the path.

  6. The second by() modulator for project() extracts the path length and divides that distance by the value of the sack() which was initialized to 1 at the start of the traversal.

  7. Group the resulting Map objects on "v" and sum their lengths to get the centrality score for each.

Eigenvector Centrality

A calculation of eigenvector centrality uses the relative importance of adjacent vertices to help determine their centrality. In other words, unlike degree centrality the vertex with the greatest number of incident edges does not necessarily give it the highest rank. Consider the following example using the Grateful Dead graph:

gremlin> graph.io(graphml()).readGraph('data/grateful-dead.xml')
==>null
gremlin> g.V().repeat(groupCount('m').by('name').out()).times(5).cap('m'). //(1)
           order(local).by(values, decr).limit(local, 10).next() //(2)
==>PLAYING IN THE BAND=8758598
==>ME AND MY UNCLE=8214246
==>JACK STRAW=8173882
==>EL PASO=7666994
==>TRUCKING=7643494
==>PROMISED LAND=7339027
==>CHINA CAT SUNFLOWER=7322213
==>CUMBERLAND BLUES=6730838
==>RAMBLE ON ROSE=6676667
==>LOOKS LIKE RAIN=6674121
gremlin> g.V().repeat(groupCount('m').by('name').out().timeLimit(100)).times(5).cap('m'). //(3)
           order(local).by(values, decr).limit(local, 10).next()
==>PLAYING IN THE BAND=8758598
==>ME AND MY UNCLE=8214246
==>JACK STRAW=8173882
==>EL PASO=7666994
==>TRUCKING=7643494
==>PROMISED LAND=7339027
==>CHINA CAT SUNFLOWER=7322213
==>CUMBERLAND BLUES=6730838
==>RAMBLE ON ROSE=6676667
==>LOOKS LIKE RAIN=6674121
  1. The traversal iterates through each vertex in the graph and for each one repeatedly group counts each vertex that passes through using the vertex as the key. The Map of this group count is stored in a variable named "m". The out() traversal is repeated thirty times or until the paths are exhausted. Five iterations should provide enough time to converge on a solution. Calling cap('m') at the end simply extracts the Map side-effect stored in "m".

  2. The entries in the Map are then iterated and sorted with the top ten most central vertices presented as output.

  3. The previous examples can be expanded on a little bit by including a time limit. The timeLimit() prevents the traversal from taking longer than one hundred milliseconds to execute (the previous example takes considerably longer than that). While the answer provided with the timeLimit() is not the absolute ranking, it does provide a relative ranking that closely matches the absolute one. The use of timeLimit() in certain algorithms (e.g. recommendations) can shorten the time required to get a reasonable and usable result.

Traversal Induced Values

The parameters of a Traversal can be known ahead of time as constants or might otherwise be passed in as dynamic arguments.

gremlin> g.V().has('name','marko').out('knows').has('age', gt(29)).values('name')
==>josh

In plain language, the above Gremlin asks, "What are the names of the people who Marko knows who are over the age of 29?". In this case, "29" is known as a constant to the traversal. Of course, if the question is changed slightly to instead ask, "What are the names of the people who Marko knows who are older than he is?", the hardcoding of "29" will no longer suffice. There are multiple ways Gremlin would allow this second question to be answered. The first is obvious to any programmer - use a variable:

gremlin> marko = g.V().has('name','marko').next()
==>v[1]
gremlin> g.V(marko).out('knows').has('age', gt(marko.value('age'))).values('name')
==>josh

The downside to this approach is that it takes two separate traversals to answer the question. Ideally, there should be a single traversal, that can query "marko" once, determine his age and then use that for the value supplied to filter the people he knows. In this way the value for the age filter is induced from the Traversal itself.

gremlin> g.V().has('name','marko').as('marko'). //(1)
           out('knows').as('friend'). //(2)
           filter(select('marko','friend').by('age'). //(3)
                  where('friend', gt('marko'))). //(4)
           values('name')
==>josh
  1. Find the "marko" Vertex and label it as "marko".

  2. Traverse out on the "knows" edges to the adjacent Vertex and label it as "person".

  3. Filter the incoming "person" vertices. It is within this filter, that the traversal induced values are utilized. The inner select grabs the "marko" vertex and the current "friend". The by modulator extracts the "age" from both of those vertices which yields a Map with two keys, "marko" and "friend", where the value of each is the "age".

  4. The Map produced in the previous step can then be filtered with where to only return a result if the "friend" age is greater than the "marko" age. If this is successful, then the filter step from the previous line will succeed and allow the "friend" vertex to pass through.

This traversal could also be written declaratively with match step as follows:

gremlin> g.V().has('name','marko').match(
             __.as('marko').values('age').as('a'),
             __.as('marko').out('knows').as('friend'),
             __.as('friend').values('age').as('b')
           ).where('b', gt('a')).select('friend').
           values('name')
==>josh

Traversal induced values are not just for filtering. They can also be used when writing the values of the properties of one Vertex to another:

gremlin> g.V().has('name', 'marko').as('marko').
           out('created').property('creator', select('marko').by('name'))
==>v[3]
gremlin> g.V().has('name', 'marko').out('created').valueMap()
==>[creator:[marko],name:[lop],lang:[java]]

Implementation Recipes

Style Guide

Gremlin is a data flow language where each new step concatenation alters the stream accordingly. This aspect of the language allows users to easily "build-up" a traversal (literally) step-by-step until the expected results are returned. For instance:

gremlin> g.V(1)
==>v[1]
gremlin> g.V(1).out('knows')
==>v[2]
==>v[4]
gremlin> g.V(1).out('knows').out('created')
==>v[5]
==>v[3]
gremlin> g.V(1).out('knows').out('created').groupCount()
==>[v[3]:1,v[5]:1]
gremlin> g.V(1).out('knows').out('created').groupCount().by('name')
==>[ripple:1,lop:1]

A drawback of building up a traversal is that users tend to create long, single line traversal that are hard to read. For simple traversals, a single line is fine. For complex traversals, there are few formatting patterns that should be followed which will yield cleaner, easier to understand traversals. For instance, the last traversal above would be written:

gremlin> g.V(1).out('knows').out('created').
           groupCount().by('name')
==>[ripple:1,lop:1]

Lets look at a complex traversal and analyze each line according to the recommended formatting rule is subscribes to.

gremlin> g.V().out('knows').out('created'). //(1)
           group().by('lang').by(). //(2)
             select('java').unfold(). //(3)
           in('created').hasLabel('person'). //(4)
           order(). //(5)
             by(inE().count(),decr). //(6)
             by('age',incr).
           dedup().limit(10).values('name') //(7)
==>josh
==>marko
==>peter
  1. A sequence of ins().outs().filters().etc() on a single line until it gets too long.

  2. When a barrier (reducer, aggregator, etc.) is used, put it on a new line.

  3. When a next line component is an "add on" to the previous line component, 2 space indent. The select()-step in this context is "almost like" a by()-modulator as its projecting data out of the group(). The unfold()-step is a data formatting necessity that should not be made too prominent.

  4. Back to a series of ins().outs().filters().etc() on a single line.

  5. order() is a barrier step and thus, should be on a new line.

  6. If there is only one by()-modulator (or a series of short ones), keep it on one line, else each by() is a new line.

  7. Back to a series ins().outs().filters().etc().

Style Guide Rules

A generalization of the specifics above are presented below.

  • Always use 2 space indent.

  • No newline should ever have the same indent as the line starting with the traversal source g.

  • Barrier steps should form line breaks unless they are simple (e.g. sum()).

  • Complex by()-modulators form indented "paragraphs."

  • Standard filters, maps, flatMaps remain on the same line until they get too long.

Given the diversity of traversals and the complexities introduced by lambdas (for example), these rules will not always lead to optimal representations. However, by in large, the style rules above will help make 90% of traversals look great.

Traversal Component Reuse

Good software development practices require reuse to keep software maintainable. In Gremlin, there are often bits of traversal logic that could be represented as components that might be tested independently and utilized as part of other traversals. One approach to doing this would be to extract such logic into an anonymous traversal and provide it to a parent traversal through flatMap() step.

Using the modern toy graph as an example, assume that there are number of traversals that are interested in filtering on edges where the "weight" property is greater than "0.5". A query like that might look like this:

gremlin> g.V(1).outE("knows").has('weight', P.gt(0.5d)).inV().both()
==>v[5]
==>v[3]
==>v[1]

Repeatedly requiring that filter on "weight" could lead to a lot of duplicate code, which becomes difficult to maintain. It would be nice to extract that logic so as to centralize it for reuse in all places where needed. An anonymous traversal allows that to happen and can be created as follows.

gremlin> weightFilter = outE("knows").has('weight', P.gt(0.5d)).inV();[]
gremlin> g.V(1).flatMap(weightFilter).both()
==>v[5]
==>v[3]
==>v[1]

The weightFilter is an anonymous traversal and it is created by way __ class. The __ is omitted above from initalization of weightFilter because it is statically imported to the Gremlin Console. The weightFilter gets passed to the "full" traversal by way for flatMap() step and the results are the same. Of course, there is a problem. If there is an attempt to use that weightFilter a second time, the traversal with thrown an exception because both the weightFilter and parent traversal have been "compiled" which prevents their re-use. A simple fix to this would be to clone the weightFilter.

gremlin> weightFilter = outE("knows").has('weight', P.gt(0.5d)).inV();[]
gremlin> g.V(1).flatMap(weightFilter.clone()).both()
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).flatMap(weightFilter.clone()).bothE().otherV()
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).flatMap(weightFilter.clone()).groupCount()
==>[v[4]:1]

Now the weightFilter can be reused over and over again. Remembering to clone() might lead to yet another maintenance issue in that failing to recall that step would likely result in a bug. One option might be to wrap the weightFilter creation in a function that returns the clone. Another approach might be to parameterize that function to construct a new anonymous traversal each time with the idea being that this might gain even more flexibility in parameterizing the anonymous traversal itself.

gremlin> weightFilter = { w -> outE("knows").has('weight', P.gt(w)).inV() }
==>groovysh_evaluate$_run_closure1@12ad1b2a
gremlin> g.V(1).flatMap(weightFilter(0.5d)).both()
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).flatMap(weightFilter(0.5d)).bothE().otherV()
==>v[5]
==>v[3]
==>v[1]
gremlin> g.V(1).flatMap(weightFilter(0.5d)).groupCount()
==>[v[4]:1]

How to Contribute a Recipe

Recipes are generated under the same system as all TinkerPop documentation and is stored directly in the source code repository. TinkerPop documentation is all asciidoc based and can be generated locally with either shell script/Maven or Docker build commands. Once changes are complete, submit a pull request for review by TinkerPop committers.

Note
Please review existing recipes and attempt to conform to their writing and visual style. It may also be a good idea to discuss ideas for a recipe on the developer mailing list prior to starting work on it, as the community might provide insight on the approach and idea that would be helpful. It is preferable that a JIRA issue be opened that describes the nature of the recipe so that the eventual pull request can be bound to that issue.

To contribute a recipe, first clone the repository:

git clone https://github.com/apache/tinkerpop.git

The recipes can be found in this directory:

ls docs/src/recipes

Each recipe exists within a separate .asciidoc file. The file name should match the name of the recipe. Recipe names should be short, but descriptive (as they need to fit in the left-hand table of contents when generated). The index.asciidoc is the parent document that "includes" the content of each individual recipe file. A recipe file is included in the index.asciidoc with an entry like this: include::my-recipe.asciidoc[]

Documentation should be generated locally for review prior to submitting a pull request. TinkerPop documentation is "live" in that it is bound to a specific version when generated. Furthermore, code examples (those that are gremlin-groovy based) are executed at document generation time with the results written directly into the output. The following command will generate the documentation with:

bin/process-docs.sh

The generated documentation can be found at target/docs/htmlsingle/recipes. This process can be long on the first run of the documentation as it is generating all of the documentation locally (e.g. reference documentation, tutorials, etc). To generate just the recipes, follow this process:

bin/process-docs.sh --dryRun               (1)
rm -r target/postprocess-asciidoc/recipes  (2)
bin/process-docs.sh                        (3)
  1. That command will quickly generate all of the documentation, but it does not do the code example execution (which is the "slow" part).

  2. Delete the recipes directory, which forces a fresh copy of the recipes to be generated.

  3. Process all of the documentation that is "new" (i.e. the fresh copy of recipes).

The bin/process-docs.sh approach requires that Hadoop is installed. To avoid that prerequisite, try using Docker:

docker/build.sh -d

The downside to using Docker is that the process will take longer as each run will require the entire documentation set to be generated.

The final step to submitting a recipe is to issue a pull request through GitHub. It is helpful to prefix the name of the pull request with the JIRA issue number, so that TinkerPop’s automation between GitHub and JIRA are linked. As mentioned earlier in this section, the recipe will go under review by TinkerPop committers prior to merging. This process may take several days to complete. We look forward to receiving your submissions!