The interplay among technologies in today’s software landscape can produce remarkable results. The convergence of WebAssembly, Rust, and Serverless/Faas architectures makes us envision data processing platforms that are both frugal, safe, and highly efficient.
In a previous exploration, we demonstrated the potential of using Rust and WebAssembly for efficient data transformation within Punch data processing pipelines. We prototyped a pipeline engine to run WebAssembly data processing functions on streaming data. Since then, the integration of Rust and WebAssembly has seen substantial growth. Rust’s expressive syntax, memory safety, and WebAssembly’s binary format have solidified their place as a dynamic duo for accelerated and secure code execution. The open-source community’s dedication to refining this integration has paved the way for smoother development experiences and improved runtime performance.
Our prototype only focused on WebAssembly. We decided to develop a robust industrial version with additional objectives:
- Frugality: There is no need to explain the dramatic contribution of data centers and data processing to the ever-increasing worldwide carbon emission. We look for the tiniest data processing engine to reduce energy costs wherever possible. This usually starts with attempts to reduce the required CPU and memory footprints; in this space, Rust is a wise choice. This is, however, not good enough; a holistic approach to processing only the required data at the right place is the proper solution. As we will see, such smart architecture can significantly benefit from the Rust WebAssembly couple.
- Safety: We work in a company that provides essential mission-critical services. Auditability, security, and sandboxing are, of course, paramount.
- Performance: Frugality and safety must come with improved performance.
We implemented a robust industrial asset called REEF to achieve these goals.
REEF stands for REsource Efficient Functions. It is (yet another) engine that allows you to run your business functions, themselves written in Rust or any programming language compatible with WebAssembly.
In the rest of this blog, we will explain how it works, why we do it, and how we plan to achieve frugal data processing.
The case for serverless architecture
Serverless means you only code the business functions, not the server provided by someone else. REEF is an application that is designed to host your functions. At Punch, we have deployed functions for years now in various runtime engines, small or big, leveraging Java, Python, (py)spark, or Flink capabilities and features. Exploring the same pattern using Rust and allowing developers to ship their functions in Rust or WebAssembly was thus a logical step.
Here is how it concretely works: to deploy your business functions on data streams, you insert them as part of graphs that you define using a simple configuration file (or a fancy editor). That graph chains data sources, (your) functions, and data sinks. Sources (resp. sinks) are responsible for reading the data from (resp. to) somewhere useful. You only focus on your business functions. Here is what it looks like.
Note that REEF, as such, is not a serverless solution; it is only one of its main building blocks. An example of a Kubernetes native serverless platform is Punch, which provides lots of management and configuration features to automate the deployment of functions inside REEF. This is, however, not the topic of this blog. Remember that REEF has no adherence to Punch, Kubernetes or any other stack; you can use it as is to build your solution.
Returning to REEF as a function processing engine, many similar engines are on the market or in the open-source communities. Some specialized in log management, others in artificial intelligence or big data. Presented this way, REEF is simply a new implementation in Rust with WebAssembly support. What are the alternatives? Here are some in the Rust ecosystem :
- Vector: a lightweight observability runtime written in Rust. It provides many collectors. REEF is similar to Vector but focuses on two features not offered by Vector: an explicit node and function API and the support of WebAssembly.
- RisingWave: a promising streaming SQL engine written in Rust. See it as an Apache Flink alternative. RisingWave is SQL-centric, which is excellent, and only allows you to plug in user-defined functions. It provides exact-once semantics, scalability, and high availability. We plan to use it to implement cases requiring stateful complex processing, which is not REEF’s mission.
- Spin: “Spin is an open source framework for building and running fast, secure, and composable cloud microservices with WebAssembly.” Spin shares REEF objectives to make deploying WebAssembly user functions easy.
Why REEF, then? REEF has a precise focus that differs from these friend technologies:
- Run WebAssembly and Rust functions in a stateless/at-least-once engine that fits small devices. This sounds simple but poses subtle issues in data exchange between Rust and WebAssembly, as explained below.
- Allow for lightweight ML use cases. We use REEF to run TensorFlow lightweight models.
- Offer device management features such as remote function updates. Our goal is to equip devices with REEF binaries that are, in turn, quickly and safely updated with new versions of WebAssembly functions. WebAssembly technology brings critical advantages.
This combination of management, frugality, and flexibility positions REEF as an attractive choice for modern, containerized ecosystems, from large-scale cloud deployment to edge IoT devices.
REEF provides 20+ out-of-the-box and well-documented connectors, including HTTP, Kafka, and MQTT, each with security features to ensure data integrity. More importantly, REEF exposes a clean API and quick-start examples that allow developers to create custom native connectors, opening up possibilities for tailored integrations. This dual approach balances convenience and extensibility, addressing fast deployment needs and specialized use cases.
Custom or not, the provided connectors free the business developers from the intricacies of I/O operations, asynchronous style programming, and subtle memory allocation issues requiring fair expertise in system programming and Rust.
The case for WebAssembly
Central to the capabilities of REEF is its support for WebAssembly. Developers can submit their functions as WebAssembly modules. This opens the possibilities to code functions in one of the 60+ programming languages compatible with WebAssembly. This portability comes at (almost) no performance penalty.
Functions are encapsulated within a secure and isolated sandbox. As a result, updating and managing these functions becomes simple and safe, even for a fleet of remote IoT devices: no more costly and dangerous firmware updates.
The integration of WebAssembly relies on two key points.
- Typed Data Serialization Mechanism: REEF leverages a specific mechanism for serializing typed data. This facilitates seamless communication between the Rust engine and the WebAssembly module.
- High-Level API: REEF exposes a high-level API that, under the hood, orchestrates the glue between the Rust engine and the WebAssembly function, as just explained.
The following code snippet demonstrates a simple data transformation use case using REEF WebAssembly capabilities. Say you want to convert a date from one format to another. Write the following function in Rust:
Let us go through the important steps:
std::timemodule is imported for handling time-related operations. You can import any WWASI-compatible standard libraries.
chronocrate is imported to work with dates and times.
reef_api::value::Valueare imported from the REEF API for data manipulation.
reef_macro::reef_bindgenattribute is imported, which is used for binding the function to the REEF engine.
Defining the WebAssembly-Powered function:
- The function
momentis defined with a type parameter representing a data row.
- The function returns a
Result<Row, String>, indicating success or failure. REEF is well-equipped to deal with errors.
- The function is decorated by the
reef_bindgenmacro that requires this signature.
- The function
Extracting timestamp and determining moment of the day:
- The timestamp is extracted from the input
getmethod and converted to an integer.
system_timeis calculated by adding the timestamp to the
DateTimeThe object is created from the
system_time, specifying that it is in the UTC timezone.
- A variable
momentis determined based on the hour and minute of the
DateTimething. It arbitrarily categorizes the time as ” midnight,” “midday,” “night,” or “day.”
- The timestamp is extracted from the input
Creating and returning the output:
- A new
Rowis constructed containing a single column named “moment” with the determined
Valueis created from the
Rowis wrapped in an
Okvariant of the
- A new
- Compile the Rust library with cargo, targeting
wasm32-wasi. The Rust Programming Language supports WebAssembly as a compilation target. The WASI target is integrated into the standard library and is intended for producing standalone binaries.
- Package the wasm module along with REEF binary or, best, use the Punch platform that provides advanced packaging and deployment for versioned functions.
- Compile the Rust library with cargo, targeting
It is worth noticing that REEF functions accept as input parameters rows of data. In this case, the input row contains a single column with a timestamp, determines the moment of the day based on the time, and produces a new row containing the calculated moment.
Why rows? Because all Punch pipelines expose standard and well-known (SQL) concepts of tables, rows, and columns. These are widely used, are easy to understand, and allow additional services to be implemented, such as data lineage, SQL manipulation, schema sharing, etc.
REEF Use Cases
There are countless potential usages of such a Rust/WebAssembly function engine. Here are the ones we currently work on:
Carbon Calculation Platform.
The first use case, Thalc, consists of collecting electricity measures from various data centers and forwarding the data to a central platform hosted on the Google Cloud Platform. REEF is used here as a lightweight agent to grab data from servers (where it is pre-installed) or from applications (using its HTTP poller source). It can thus act as an agent or as a gateway. In both cases, it forwards the data using its HTTP sink to a platform where carbon and energy costs are further computed and exposed to the customers.
REEF is used here on small devices to capture bird songs (audio signals), apply some TensorFlow Lite song identification, and forward the results to a central platform where the findings can be further processed and visualized. This is somehow a re-implementation of the well-known Birdnet application using Rust and WebAssembly technologies. We also explore the deployment of such applications onto RISC-V hardware. Here is a simple architectural view of this application:
A complete and separate blog describes this interesting experiment.
Log Management and CyberSecurity Platforms
Lastly, we plan to use REEF as a unique processing engine for Punch-powered log management platforms to keep reducing their overall footprints. Today, a single Java Punch log processor can handle several tenths of thousands of logs per second. The processing includes log autodiscovery, parsing, enrichment, normalization, error processing, and indexing. We do not expect to beat these numbers using Rust. We have already shown that this Java engine performs similarly to Vector, a similar Rust implementation.
However, REEF can reduce the required memory and CPU resources, significantly reducing the underlying (Kubernetes) platform size. This footprint reduction makes deploying small remote log collectors easier on a single or a few servers, on top of lightweight Kubernetes instances, or even lighter directly as an autonomous application that will provide remote management facilities. Think of updating parsers packaged as WebAssembly functions.
Looking Ahead: A Roadmap to Uncharted Horizons
Our exploration into efficient data processing and sustainability is just starting. We are embarking on a new and exciting chapter as we forge partnerships with academics where innovation and biodiversity conservation intersect.
We are collaborating to develop an innovative biodiversity monitoring solution. This comprehensive solution encompasses end-to-end signal capture using intermittent microcomputers, machine learning-based species identification, and data transmission to a central platform. There, Punch showcases its value, facilitating the development of data algorithms through its fully-packaged Jupyter component and empowering the creation of insightful biodiversity dashboards.
The challenge we face is minimizing the energy consumption of the deployed sensors and ensuring that they operate efficiently. By leveraging REEF, we aim to optimize the energy consumption of the deployed sensors, maximizing their battery life and minimizing their environmental impact.
This endeavor poses multifaceted challenges. Fleet management and deployment of sensors, cross-compilation of code for compatibility across diverse hardware, and ensuring the seamless integration of REEF into our solution are just a few aspects we are currently addressing.
Conclusion: Crafting the Future of Data Processing
Our solution’s technical pas de trois of frugality, safety, and security resonates across the industry, from environmental monitoring to sustainable energy management, embedded systems, and wildlife research. Through our dedication to collaboration and open-source principles, we invite developers, researchers, and enthusiasts to join us in this exciting journey.
Stay tuned for more exciting updates and innovations in the coming months!
As always, thanks to the team!