Apache Beam project

This page contains the details of a technical writing project accepted for Google Season of Docs.

Project summary

Open source organization:
Apache Beam
Technical writer:
Sruthi Sree Kumar
Project name:
Update of the runner comparison page / capability matrix
Project length:
Standard length (3 months)

Project description

Apache Beam is a unified platform for defining both batch and stream processing pipelines. Apache Beam lets you define a model to represent and transform datasets irrespective of any specific data processing platform. Once defined, you can run it on any of the supported run-time frameworks (runners) which includes Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam also comes with different SDK’s which let you write your pipeline in programming languages such as Java, python and GO.

I am submitting my application for the GSOD on “Update of the runner comparison page/capability matrix”. As Apache Beam supports multiple runners and SDK, a new user will be confused to choose between them. The current documentation of different runners gives a very brief overview of the runner. My idea is to add more comprehend details of each runner on the particular runner documentation page. Also, I would like to update the description of the example word count project to add a detailed explanation. For this, my plan is to try every word count example locally in my machine and find out if some steps are missing and add more explanation on the process. Another thing which I have noticed is that the documentation for the runners does not follow any pattern(Few has got an overview section while others start with how to use or the prerequisite or some random title). I will update all of them to follow a single simple pattern.

I plan to add a new page to describe each runner and provide a descriptive narration to each of them[BEAM-3220]. From this page, users can redirect to the detailed description page of each runner and the capability matrix. I also plan to add a descriptive comparison of each runner here. Currently, I am using Beam NEXMark for benchmarking Flink runners for my master thesis. As I am completely aware of NEXMark benchmarking, I would like to include the benchmarking results of each runner in both batch and streaming mode here(BEAM-2944). I would also update the NEXMark documentation if I find out any parameters/ configuration are missing/removed. Before when I was using Flink runner I was stuck initially as one of the parameters was missing in the documentation. But now as I am more familiar with the NEXMark code base as well it would be easier for me to benchmark the runners and add the metrics. In this same page, I would like to include a brief summary of the production readiness of each runner.

In the current documentation, the support for classic/portable runner is included in each runner description page. I think it's also better to bring them all at one place, either in the capability matrix or in the newly added description page. Also, currently, the portability support is maintained in a separate google sheet which I would like to merge to the compatibility matrix. https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0). As part of this task, I plan to include all the major/minor corrections which are mentioned in BEAM-2888.

I consider GSoD as an opportunity to step into open source contributions. I will continue to contribute to open source projects especially Beam and would like to continue as an active community member. As Apache Beam has got an active community with continuous features being developed, I think there is always a scope to improve the documentation to make it updated. Also, I would like to contribute to the development work as well. If I have sound knowledge in Beam, I can also help the user community as I always got help from the community when I started with Beam.

I believe that I am the right person for this project because:

  1. I am a distributed systems enthusiast who is trying to understand the internals of data processing systems.
  2. I have experience in working with Apache Beam and Apache Flink as a user.
  3. I have already understood Apache Beam and Apache Flink code base as a developer.
  4. I have done a project to compare different beam runners.
  5. I have experience in writing technical blogs to explain concepts of big data processing and distributed systems.
  6. Currently, I am working on my master thesis to improve the performance of Apache Flink state backend for which I am using Apache Beam NEXMark implementation for benchmarking and I have contributed to updating Apache Beam documentation.
  7. As I have 4 years of work experience as a software developer, I have written multiple technical design documents and product documentation and Readme files(which I do not have access right now).
  8. I write documentation in such a way that anyone without previous knowledge will understand it at first glance.