Measuring the health of an open source community is a topic of increasing importance. From the moment an open source community forms, researchers, maintainers, and organizations try to understand whether the community is healthy and what makes it healthy.
"If you don't measure it, you cannot improve it"
—Peter Drucker
The Community Health Analytics for Open Source Software (CHAOSS) project offers a formal approach to understanding community health. The project started in 2017, bringing four stakeholders (open source communities, academia, organizations, and toolmakers) together under the Linux Foundation's umbrella. GrimoireLab, the focus of this article, is one of CHAOSS's co-founding projects.
I have been involved in open source for more than 14 years and helped start the CHAOSS project during my PhD while researching organizational engagement in open source. After I earned my PhD, I joined Bitergia, the toolmaker that co-founded CHAOSS by contributing its open source toolset, GrimoireLab. In my new job, I learned insights that I think are relevant to anyone who wants to develop tools for analyzing open source projects. GrimoireLab has an interesting story because it has all the elements of a hero's journey, which I'll share in three acts, along with some of the lessons learned along the way.
Act 1: Departure
The first act of a hero's journey describes the not-yet hero and the environment and introduces a call to adventure.
GrimoireLab started in academia 16 years ago, when the LibreSoft team at the Universidad Rey Juan Carlos in Spain built tools for analyzing software development projects. For scientific rigor and to allow replicability of their work, all of the software was released under an open source license.
In working with organizations to analyze their software development projects and to validate the tools with experts in the field, GrimoireLab's first iteration of tools received immense interest from organizations. The business value of analyzing software projects was apparent and Bitergia was founded to provide software development analytics, advancing the value of the open source GrimoireLab and its toolset.
In short, GrimoireLab was born from the desire to analyze and understand open source projects and communities. The call-to-adventure came from organizations that promised treasures. GrimoireLab crossed the threshold on its hero's journey when Bitergia took GrimoireLab from the comfortable academic setting into the uncharted business world.
Act 2: Initiation
The second act of a hero's journey introduces trials, allies, and enemies, and the hero grows in character by overcoming one challenge after another. The hero approaches the innermost cave, meets the ultimate challenge, and may even die to be reborn. This was the fate of GrimoireLab.
The lessons GrimoireLab learned from 16 years of open source project analysis built the foundation of GrimoireLab's latter design. Not all lessons learned are easy to share, and some still carry the stench of defeat.
Choosing a database technology
The predecessor of today's GrimoireLab was built with a relational database to store and retrieve open source community data. Why was a relational database chosen? Maybe universities focused their teaching on relational databases. Maybe it was the zeitgeist to build applications on relational databases. I don't know if it was a deliberate decision, or it just happened that way, but there were problems with it.
A major one was the relational schema, which only allows predetermined data and enforces rules upon it. Different data sources, like mailing list archives, the Git log history, or issue tracker APIs, have different data points that are interesting to collect and analyze.
A second problem is that some data fields can be mapped and shared across the data sources (e.g., dates, author), while others are unique and require adjustments to the relational data schema (e.g., commit hash), which then requires changes to the data collection and data analysis components that interact with the database. Furthermore, open source collaboration platforms evolve constantly and change their data API, which requires changes to the database schema.
Therefore, GrimoireLab had to redesign its tools to abandon relational databases and their shortcomings, which hindered open source community analysis. On the hero's journey, the original GrimoireLab tools built with a relational database failed the trials and had to die. From the underworld, a new GrimoireLab platform design was reborn.
In designing the current GrimoireLab platform, the engineers selected Elasticsearch as the database. GrimoireLab now loads JSON files into Elasticsearch and queries the data from this flexible data store. JSON files can be unique to the different data sources where community data is collected, which allows the flexibility to add additional data sources and make different data available.
A compelling reason to choose Elasticsearch was its data visualization platform, Kibana, which allows the GrimoireLab engineers to focus on the data and reuse code for visualizing it.
Lesson learned: Use a flexible database schema that can deal with changes to the data.
Managing community member identities and affiliations
One of the challenges GrimoireLab faced is the fact that open source communities use a variety of different collaboration platforms. GitHub and GitLab are popular platforms that host source-code repositories, issue trackers, and related tools that simplify collaborative software development. Email lists, instant messaging, and forums are platforms used for communication and collaboration.
When we want to analyze open source community health, we are interested in community interactions in all these platforms. To gain complete insights into a community, we want to integrate the data from these different data sources in order to analyze it.
One challenge with integrating data from several data sources is establishing contributors' identities. Community members may use different aliases or email addresses when they contribute to various communities. For example, I have used my personal, university, and work email addresses to write to mailing lists, and I have contributed as "Georg Link," "Georg J.P. Link," "G. Link," and "GeorgLink." This may not be by choice, as community members may have to select a different username when their preferred one is not available. However, for a full overview of a community member's contributions in an open source community, it is necessary to combine all of these different identities into one.
A related challenge is establishing community members' affiliations. In today's ecosystem, many contributions to open source communities are made on behalf of an organization or employer. Companies are increasingly relying on and contributing to open source software. A critical component in analyzing open source community health is understanding which organizations are involved by proxy through their employees' involvement. Sometimes a community member's affiliation is obvious, such as when they use their work email with the organization's domain. Other times, this information is not available from the data sources directly, so it must be input manually.
GrimoireLab overcame these challenges after it found an ally in SortingHat, which can identify people and their affiliations. As the identities provider for GrimoireLab, SortingHat enriches the collected data with information about the people who made contributions.
Lesson learned: Managing people's identities and affiliations is a critical element to doing quality analysis in open source.
Visualizing community health data
On a hero's journey, the hero may be lured into a "shortcut" that leads to dangers and setbacks. This was GrimoireLab's case for hard-coding visualizations. Its predecessors presented data through hard-coded, "quick and dirty" visualizations. Changes to the data structure or presentation required changing source code.
This shortcut was chosen because the users were also the developers, so they could adjust the visualizations as needed. But it created setbacks when Bitergia started providing community health analytics to customers. The new users wanted to explore their data but could not modify the source code.
Our hero, GrimoireLab, found its way out of the predicament when it met an ally that could produce visualizations using a generic data exploration and visualization tool.
The GrimoireLab platform now uses Kibana as the default tool for visualizations. Users can query the data freely and create custom visualizations. This requires knowledge about the underlying data structure and meaning. From a tooling perspective, changes in the data structure require no changes to the generic data exploration and visualization tool.
A caveat to providing a generic data exploration and visualization tool is that users value quick wins. The data should be visualized immediately to show users at least an overview of a community. Therefore, GrimoireLab Sigils allows the sharing of visualizations and dashboards.
Lesson learned: Do not hard-code visualizations if the format and type of data may change.
Providing data to other tools
Some trials are straightforward for a hero who has a reliable ally to help solve them. The hero does not need to understand how a trial is overcome nor feel like they had any part in overcoming it. However, without the hero, the ally cannot overcome the trial and may not even know about it.
GrimoireLab users have different preferences for what data they want to see and how they want to see it. Some may be happy to query a database, some may be happy to take a quick glance at a dashboard, some may be happy to explore data visually in a generic tool, and some may only have access to the data through reports or announcements from others.
The latter group could use a snapshot of the data. However, like company stock prices, community health metrics change. Therefore, users may want to include "live" metrics and visualizations on their websites or other places, so GrimoireLab provides two options for different user types.
The first option in GrimoireLab is to use Kibana to embed visualizations and dashboards using HTML's inline frames. The user designs the visualization in Kibana and embeds the visualization, which is loaded from Kibana with the latest information. This requires Kibana to be publicly accessible. This option is good for users who need a quick way to share community health metrics but don't want to spend effort customizing the visualization.
Lesson learned: Provide an easy way to share "live" metrics.
The second option is for users to build custom integrations by querying the Elasticsearch API directly. Users can query the database for the desired data and then build a customized visualization. This option is good for users who want to integrate their data into other platforms and fully control the visualization and user experience. Integrators can use Kibana to explore the data, form specific queries, then copy the query from Kibana into their own integration work.
Lesson learned: Provide an easy way to explore data and build custom queries for data that may be needed for other tools.
Calculating metrics from raw data
As a hero grows with every trial, the journey continues with ever-more challenging trials, keeping the story enticing. GrimoireLab overcame the trials of collecting data, storing it effectively, and visualizing it. The next trial was to produce meaningful metrics from raw data.
To demonstrate, let's look at the CHAOSS metrics. Some metrics can be built from data that is directly available from data sources. For example, Code Changes is a count of commits in the Git log. Other metrics are not directly available but require manipulation of the data first. For example, Issue Response Time requires calculating the time difference between when an issue is opened and when the first response is submitted.
I expect that as CHAOSS continues defining metrics, they will become more complex. Complex metrics may be based on other metrics. A software development analytics tool like GrimoireLab must be able to calculate these complex metrics.
GrimoireLab's GrimoireELK stores incoming raw data before doing anything with it. Then the data is enriched, and the result of that process is stored separately as enriched data. Then studies are performed in which enriched and raw data are combined in different ways from different data sources and calculated for specific purposes. For example, one of GrimoireLab's default studies identifies casual, regular, and core contributors and enables analyzing the contributions of these groups.
Lesson learned: Enriching data may require several steps with simple metrics used for more complex metrics that are comprised of several metrics combined together.
Act 3: Return
The third act of the hero's journey begins after the trials have been overcome, a reward is received, and the hero returns home. The return may begin with a kind of resurrection or rebirth for breathtaking action. Finally, the hero gains the freedom to live as the master of two worlds. GrimoireLab was reborn. From the ashes of its ancestors, the lessons learned were implemented in the completely resurrected (i.e., re-engineered) GrimoireLab platform that we have today.
GrimoireLab received its reward in the form of recognition. As a founding project of the Linux Foundation's CHAOSS project, GrimoireLab ascended to be part of a larger ecosystem. Although it started at a small company based in Spain, GrimoireLab is now a part of the fast-growing commercial open source ecosystem created by the Linux Foundation with five working groups and dozens of contributors.
Flowing from the GNU General Public License, GrimoireLab has the freedom to choose its own destiny. And most importantly, so do its users.
2 Comments