How to write a Page Rank application

In this example, we will detail a very simple implementation of the page rank algorithm and how input/output works in Giraph. At the end of this short tutorial, you should have a simple working piece of code that will run on a real cluster.

Choose your graph generic types

Giraph implements bulk synchronous parallel computing model (http://en.wikipedia.org/wiki/Bulk_synchronous_parallel) specifically for graph processing. Graphs are composed of vertices and edges. We have four types that need to be defined by the user. Note that if you intend to use primitive based objects for any of the types, they are all available (i.e. ShortWritable, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable). They are lots of other types available as well (i.e. TextWritable, BytesWritable, MapWritable, etc.). Note that if you are not using a type (i.e. vertex value or edge value), you can set them to NullWritable.

  • Vertex id (type I)
    • Used to identify a vertex
    • Implements WritableComparable
    • Typically a LongWritable to support 64-bit integers
  • Vertex value (type V)
    • Used to store a vertex value
    • Implements Writable
    • Can be anything, a map for label influence, a DoubleWritable for page rank, etc.
  • Edge value (type E)
    • Used to store a value on an outgoing edge
    • Implements Writable
    • Can be anything, a DoubleWritable for page rank weights, NullWritable for unweighted graph, etc.
  • Message value (type M)
    • Used as a container for sending messages to other vertices
    • Implements Writable
    • Can be anything, a DoubleWritable for page rank values, LongWritable for component ids, etc.

Choosing the types for Page Rank

For the vertex id (I), we'll use LongWritable, which allows us to load user ids from the facebook graph. For the vertex value (V), we'll use DoubleWritable, which will be the page rank value. For the edge value (E), we'll use DoubleWritable to represent the edge weights. For the message value (M), we'll use DoubleWritable, which represents the page rank value that a vertex propagates to my neighbors.

Define how to load the graph into Giraph

There are two types of input vertex-centric and edge-centric. Some algorithms need vertex centric data, such as pairs of vertex ids and initial page ranks. Some algorithms only look at edges. For example, connected components can be run without any vertex values. Some have input from multiple sources (i.e. vertex values from vertex-centric input and edges from edge-centric input). You need to think about your input sources and what makes sense for you.

Giraph can convert an input source into vertex-centric or edge-centric output (i.e. HDFS files, HBase, mySQL, etc.) as long as someone writes the code to convert the source data into a graph. In Facebook, we typically load from /store to Hive.

Vertex Input Format

A vertex input format can create vertex ids, vertex values, and edges. It can also create a subset of that data, such as simply vertex ids and vertex values (in that case, you'll likely want an edge input Format to load the edges).

LongDoubleNullTextInputFormat

This vertex input format loads a simple format where a line is a vertex, begining with the vertex id as a long and than destination edge ids. The initial vertex values are 0 and the edges do not have any weights.

Edge Input Format

An edge input format can create edges from one vertex to another and edge values if desired. It can be used as a stand alone input format or in conjuction with the Vertex Input Format. We do not use it in this example

Define how to store the graph from Giraph

After the Giraph computation, you need to store your data back to persistent storage (i.e. HDFS, HBase, or a Hive table).

Vertex Output Format

A vertex output format can dump anything to persistent storage in a vertex-centric way. You can still dump only edges if you choose, or any other type of data. This is pretty flexible. For our example aplication, we simply want to dump the vertex id and the associated final page rank.

VertexWithDoubleValueNullEdgeTextOutputFormat

This vertex output format is perfect for our example. It stores the graph in a single text line with tab as the delimiter between vertex id and vertex value.

Building and compiling page rank

TBD