The new elastic 5 GA is out. It provides many improvements  and great new features. In this post we have a look at the new Ingest nodes, that provide a way to deploy logstash filters without the burden of deploying and running logstash processes. Each running an input-filter-output pipeline.

Since ingest node look to us like a much better solution than logstash, it is interesting to compare it to the punchplatform way to deal with on the fly data transformation.

From day one we at the punchplatform decided not to use Logstash for data processing. Our view was and still is that managing clusters of logstash instances was not a good architectural pattern (scalability ? resiliency ?), and that writing robust and modular log parsers using logstash configuration files would turn out to be difficult. We were looking for a cleaner programming pattern so that ourselves and our users could write functions (and only functions), free from configuration considerations (where the data comes from, where it goes next), and using a compact programming language so as to reduce verbosity to a maximum.

Hence the punch programming language, used to write small functions in turn deployed in distributed stream processing engines such as storm or spark. We decided to call these functions punchlets. The idea is : just like a servlet is deployed in an HTTP application server, a punchlet is deployed in a distributed big data platform.  Here is a simple punchlet that parse some apache logs :

if (!grok("%{COMBINEDAPACHELOG:[logs][grok]").on([logs][log])) {
    raise("grok failed to parse apache log");
}

With that language you can do virtually everything : data transformation, filtering, enrichement. It is compiled on the fly and run with excellent performance (it basically is a java compatible language). It supports all the well-known logstash goodies such as key value, csv, grok, dissect operators.

Equipped with punch we needed a runtime engine. We picked apache Storm. We ended up with a logstash pattern (input-filter-output) implemented using spouts and bolts. There we have scalability, end-to-end acknowledgement. Since then we also made it possible to run punchlets in Spark.  It enables us to design Spark pipelines using configuration files, embedding a few lines of punch to take care of field selection filtering or transformation.

Coming back to the architecture there are two benefits : first the data transformation pipeline is independent from Elastic. That turns out to be key so as to send the parsed and normalised data to third-party components such as real-time correlation engine, other big data lakes etc… Somehow this argument is similar than having a dedicated stage of logstash. The thing is : it is much easier and powerful to have that logic distributed and run in Storm. You need more power ? you simply add more.

Since then the punchplatform runs other components and in particular Spark pipelines, and KafkaStream application are now appearing. Because the Punch is provided by a plain java library, it is very easy to run a punchlet inside any (jvm based) application. We realized it was very handy to add a few lines of punch in an analytics Spark pipeline, to take care of feature or data transformation. Not only handy, efficient too because a punchlet runs at the speed of compiled java bytecodes.

Author


0 Comments

Leave a Reply

Avatar placeholder