In this tutorial, we will create our first Samza application - WordCount. This application will consume messages from a Kafka stream, tokenize them into individual words and count the frequency of each word. Let us download the entire project from here.
The interface provides a single method named describe(), which allows us to define our inputs, the processing logic and outputs for our application.
Describe your inputs and outputs
To interact with Kafka, we will first create a KafkaSystemDescriptor by providing the coordinates of the Kafka cluster. For each Kafka topic our application reads from, we create a KafkaInputDescriptor with the name of the topic and a serializer. Likewise, for each output topic, we instantiate a corresponding KafkaOutputDescriptor.
The above example creates a MessageStream which reads from an input topic named sample-text. It also defines an output stream that emits results to a topic named word-count-output. Next let’s add our processing logic.
Add word count processing logic
Kafka messages typically have a key and a value. Since we only care about the value here, we will apply the map operator on the input stream to extract the value.
Next, we will tokenize the message into individual words using the flatmap operator.
We now need to group the words, aggregate their respective counts and periodically emit our results. For this, we will use Samza’s session-windowing feature.
Let’s walk through each of the parameters to the above window function:
The first parameter is a “key function”, which defines the key to group messages by. In our case, we can simply use the word as the key. The second parameter is the windowing interval, which is set to 5 seconds. The third parameter is a function which provides the initial value for our aggregations. We can start with an initial count of zero for each word. The fourth parameter is an aggregation function for computing counts. The next two parameters specify the key and value serializers for our window.
The output from the window operator is captured in a WindowPane type, which contains the word as the key and its count as the value. We add a further map to format this into a KV, that we can send to our Kafka topic. To write our results to the output topic, we use the sendTo operator in Samza.
The full processing logic looks like the following:
Configure your application
In this section, we will configure our word count example to run locally in a single JVM. Let us add a file named “word-count.properties” under the config folder.
For more details on Samza’s configs, feel free to check out the latest configuration reference.
Run your application
We are ready to add a main() function to the WordCount class. It parses the command-line arguments and instantiates a LocalApplicationRunner to execute the application locally.
Before running main(), we will create our input Kafka topic and populate it with sample data. You can download the scripts to interact with Kafka along with the sample data from here.
Let’s kick off our application and use gradle to run it. Alternately, you can also run it directly from your IDE, with the same program arguments.
The application will output to a Kafka topic named “word-count-output”. We will now fire up a Kafka consumer to read from this topic:
It will show the counts for each word like the following:
Congratulations! You’ve successfully run your first Samza application.