CERN-HSF project

This page contains the details of a technical writing project accepted for Google Season of Docs.

Project summary

Open source organization:
CERN-HSF
Technical writer:
Ariadne
Project name:
Rucio – Modernize (restructure & rewrite) the Rucio documentation
Project length:
Standard length (3 months)

Project description

Abstract: The Rucio framework was developed with a view to manage and organize large volumes of geographically distributed scientific data across heterogeneous data centres. Offering capabilities such as distributed data recovery and adaptive replication, the framework is highly scalable, modular, and extensible. The consumers of documentation for such a service would be from varied backgrounds and have varied requirements when accessing it. Good documentation for such a service should therefore, simplify its adoption and utilization for end-users while also being a reference for common issues & troubleshooting.

In the absence of such documentation, there would be significant hurdles in efficient & effective utilization. This could potentially lead to increase in support costs and pose reputational risk to the corporate identity of the product. Documentation is, after all, a mode of communication. Ensuring that the communication is encapsulated in a manageable & accessible framework while remaining relevant with appropriate versioning is, therefore, ensuring we’re communicating for success.

At the time of writing this, the Rucio framework has been utilized for powering high-energy requirements of the ATLAS and CMS experiments at the LHC. It is also being used to support the needs of diverse scientific communities beyond LHC, such as astrophysics; thereby making it necessary for the documentation to be as relevant and as accessible as possible. With the help of this project, CERN wants to enable the end-users of Rucio to have a seamless experience while utilizing the framework by providing a centralized view to access all of the relevant documentation.

Current State: As of today the user documentation is spread across different places and is in multiple formats including scientific articles, readthedocs.io with source in the code, Google Drive, GitHub, DockerHub, or Wikis. Multiple sources introduce problems with tracking of versions and correctness of the documentation. In addition, a decentralized model of documentation poses significant hurdles in navigation & surfacing of relevant information for a given use-case. Especially in the case of Wikis, the information provided for a particular experiment could very well be applicable to other instances residing in same/other sources as well. However due to lack of consolidation and appropriate linkages, this information lies dormant and, potentially, underutilized.

Why is your proposed user documentation an improvement over the current one? Given the multi-faceted problem, the model proposed below eliminates the hurdles with navigation, versioning, tracking, and surfacing of documentation as detailed below:

Restructuring the documentation aims to simplify the efforts expended in navigating for an end-user. He/she need not go down rabbit holes while searching for information since they would be categorized/labelled for simplicity. From an administrative perspective, versioning & tracking would be made easy since restructuring would offer the freedom to categorize on the basis of requirement. Centralizing all of the restructured documentation would be to ensure that all of the information is visible to the user without having to refer to multiple sources.

Analysis: Post reading through the requirements brief & having conversations with the mentoring team, my deductions of the current state of Rucio documentation is as below:

There are six major sources of documentation: - Google Drive Link : https://drive.google.com/drive/folders/1EEN8l1dFjDSgavPrAMMooDjEodHP7aU7

  • Readthedocs powered by Sphinx with source in the code Link to Code: https://github.com/rucio/rucio Link to ReadtheDocs: https://rucio.readthedocs.io/en/latest/

  • DockerHub Link: https://hub.docker.com/u/rucio

  • GitHub Link: https://github.com/rucio/rucio

  • Wikis Link: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/AtlasDistributedComputing

  • Scientific Articles Link: https://arxiv.org/abs/1902.09857

The documentation across these sources are in different formats. For e.g. Google Drive has documentation in the form of Slides and Docs, GitHub has files primarily in the reStructuredText markup language etc. There is a lack of versioning and tracking leading to redundant information being published across multiple sources. There is no uniformity in the labelling/categorization of information. Therefore, previous experience and expertise is required while searching.

Given the myriad formats & sources, the expectation is to restructure the information and centralize it using mkdocs. Towards bettering my understanding about the tools, I have researched and familiarized myself with their usage.

Verdict: The existing documentation is unstructured and scattered without appropriate linking. It also lacks centralization & uniformity in formatting. This results in users having to expend extra efforts for searches. Such gaps also introduce unnecessary pressure on administrators/maintainers/leads because of which it becomes difficult to maintain a community-driven approach for maintenance & updating the documentation. The user & contributor experience is considerably degraded and there would be repeated

Structure for the proposed documentation: After a thorough analysis of the requirements, I have decided to address the major pain points via a restructured model of documentation.
The restructured model is demonstrated on the mock-up attached below and would categorize every piece of documentation into the below 7 categories:

  • About
  • Getting Started
  • Concepts
  • Rucio Interfaces
  • Tasks
  • Tutorials
  • Advanced know-how

Of course, there are improvements such as adding links that I’d like to work on post the completion of this program. With over 1000 active users accessing 500 petabytes of data on Rucio, the proposed restructure of its documentation should be able to significantly reduce the need for users to resort to the support mailing list. The target would be to better User Experience by lowering the number of click rates & by easy surfacing of documentation via categorization & labeling. Everything there is to know from a user/operations/admin personnel perspective would be available within 3 clicks or less.

Mock-up link: https://drive.google.com/file/d/1vSYgOkB9s9eEr2soNs7ujMLHzDlKn_hr/view?usp=sharing)

Project Goals: - Analyze and prune redundant information available from various sources. i.e. every piece of information should have one source of truth. - Restructure by labeling & categorizing the existing documentation into different parts - Migrate the restructured documentation to a centralized view based on mkdocs - Reformat/import documentation that cannot be migrated due to file format constraints - Set up community driven modification of documentation to ensure any missing gaps are filled - in terms of linkages, updates to information or correction of errors.

The barebones for this system are already in place, however, my model would improve upon the existing system by laying down proper guidelines for contribution & governance with appropriate documentation. Furthermore, I envision incorporating GitHub project boards for tracking issues and overall healthiness of the project.

Timeline: - Before August 16th --> Familiarize myself with current versions of documentation & Rucio --> Learn new techniques and technical writing skills that will be helpful during the term of the project --> Contribute towards documentation issues, if any, reported on GitHub

  • Community bonding (August 17 - September 13) --> Set up a communication channel and time to account for the difference in time zones (Pune is 3 hours 30 minutes ahead) --> Major pain points to be identified towards refinement of goals --> Learn more about the community, organization, and the framework by engaging in conversations. --> Assessment of the proposed documentation structure with mentors and other key members of the organization for viability & feasibility of implementation. --> Finalization of the proposed features and any other modifications that may need to be made to the existing documentation.

  • Documentation Period (14th September - 30th November) Basis the proposed format I formulated here, I have provided a breakdown of the major milestones I plan to achieve during the documentation period.

--> Milestone #1: Categorizing & Labelling ETC: 28th September, 2020 Assimilating the available documentation and labeling them would greatly simplify the restructuring & pruning process.

--> Milestone #2: Analysis, Pruning & Restructuring ETC: 19th October, 2020 Documentation that has been categorized during Milestone #1 would be analyzed for duplicates + redundant sources of information. As stated in the project information, we are targeting one source of truth for all information that is available.

--> Milestone #3: Centralizing & Reformatting: ETC: 9th November, 2020 Once the documentation has been pruned & restructured properly, I would aim at reformatting it first. Owing to the various sources, the formats are different and first need to be transformed into an appropriate format. Once this is done, the centralizing process would be made easier.

--> Milestone #4: Setting up tracking boards + documentation around governance/contributions ETC: 23rd November, 2020 This phase is to ensure that post the completion of the project, the documentation continues to remain updated. Laying down guidelines and setting up project boards will ease the burden on the administrative members to solicit community contributions and track them effectively.

--> Project Evaluation (30th November - 5th December) Submit a project report and evaluation of my mentors Write and submit a report of my experience as a participant in Season of Docs.

Why this project? I've believed that supplementing code with well-written and versioned documentation is the only way to enable further adoption & better usage. Personally, I have been fascinated by the way CERN has pioneered cutting-edge research in different areas of Physics. Given the scale of information processed, transferred, and generated during such experiments, I was always intrigued as to how data was managed for reference & future usage within the organization. It would be an honor to contribute towards improvement of documentation for a framework that has been powering some amazing scientific research and discoveries.

Why am I the right person for this project? In addition to meeting the prerequisites, I am confident I would be the right person for this project since:

I’m already working on modifying existing documentation for Kubernetes. These contributions have resulted in me being enlisted as a Release Docs Shadow for the 1.19 Kubernetes Release cycle wherein I contribute to effectively maintaining and upgrading documentation for new features that get added during releases. I believe that good documentation is the backbone for a great product/service. Whether it be procedural or technical, information that is well-written, concise, and easily accessible would be an impetus in driving the adoption & aiding better usage. Having worked with data-driven distributed systems all throughout my career, I believe I’m best positioned to understand the intricacies in requirements with respect to documentation for such systems. Having been an end user myself, I’m familiar with the pitfalls of poorly written/incorrect documentation and would be careful to accommodate those into consideration during the restructure.