One of the benefits of the ownCloud architecture is that it scales from small installations, such as a Raspberry Pi for 1-10 users, to up to 500,000 users and a petabyte of storage running in a powerful cluster setup. Together with research organizations CERN and AARnet, we've been collaborating to bring this to a new level—breaking through the petabyte barrier together.
Working with CERN for 9.0
To make ownCloud ready for the next milestone and multi-petabyte scale, we collaborated with the IT team behind the largest particle physics laboratory in the world, the European research organization CERN, as well as Australia's countrywide networking provider for research institutions, AARNET (Australia's Academic and Research Network). CERN has an amazingly powerful infrastructure to collect and distribute for analysis about a petabyte of data generated per day to more than 8,000 scientists and scientific organizations across the globe. The file storage, called EOS, has 140 petabytes of storage on 35,000 HDs distributed over 1,000
nodes at two sites capable of delivering 2x100Gb/s. On top of EOS, ownCloud provides a user interface for sharing and collaboration.
We've worked with CERN before when we adopted the testing framework they developed for the ownCloud syncing process. This time, we set our eyes on our data storage infrastructure.
Scaling to the next level
ownCloud always supported local file systems and NFS storage. Running on top of IBM Spectrum Scale or storage products from Fujitsu and others allowed ownCloud users to scale to 100,000s of users. A year ago we added support for S3 and Swift-compatible object stores to support even more storage backends. But we wanted to push the boundaries even further. With ownCloud 9.0, we aimed to create the ability for storage backends to deeply integrate in their underlying storage, using for example their existing metadata capabilities. This means that we would no longer have to store and manage this metadata in the ownCloud database, removing a potential bottleneck and allowing innovative technology like EOS at CERN to deeply integrate with ownCloud. This functionality was developed based on feedback from CERN (as well as AARNet and other universities and research institutions) given at the ownCloud Contributor Conference last year. These organizations want to provide scientists with access and sharing abilities to their many petabytes of data transparently through ownCloud.
Results
As a result of this, for ownCloud 9.0 we developed new Storage and Sharing APIs, which make it possible to write storage connectors that access and use advanced capabilities and metadata directly from the storage. One example is the EOS filesystem that is developed and used by CERN to store and manage their huge amount of scientific data. This filesystem can provide metadata, such as ETags, FileIDs, and more, which can be used by ownCloud 9.0 directly, avoiding the need to store this in the ownCloud database and providing the sought-after reduction in overhead. The option to leverage existing sharing capabilities of the storage layer (if available) is also possible, which removed the need to store sharing information in the central database.
There is an example implementation and developer documentation on github showing how such deeply integrated storage connectors can be written.
During the development, we received regular feedback on the design, architecture, and API from CERN team members on GitHub. Moreover, CERN has promised to give the new API an extensive workout. Currently, the team is updating to ownCloud 8.2, and once that is completed, they will get started with the new API for testing purposes. Further changes are planned, and we continue to benefit from their input and experience in this area. We might see further scalability improvements as a result of conversations with the CERN team, which identified some bottlenecks in the current approach.
Open collaboration
This collaboration between CERN, AARNet, ownCloud, and others is an impressive demonstration of the benefits of a full open development process and open source software. All discussions, code review, and testing happened in the open on GitHub, showing how an open and transparent process guarantees that we get the best possible results.
A project growing naturally like ownCloud does, regularly reaches points in which the evolved requirements require heavy re-architecting and adaption of the code base to the new realities. In our experience, working in the open while going through such a transformation and getting feedback directly from the end users is a massive benefit, a win-win for all involved, and speeds up the way to maturity.
1 Comment