The punchplatform lets you design arbitrary spark pipelines on top of your Elastic data by simple configuration. Refer to the PML documentation. In this post we explain how to execute a straight Spark job.

Doing that is useful to quickly prototype, to understand what your are doing, or even simply to learn the spark basics. Because the standalone punchplatform comes in automatically equipped with a Spark runtime, ready to use data injectors, Elastic components and kibana dashboards, it cannot be simpler to start playing with Spark.

Make sure you have :

  • a decent Unix or Macos laptop.
  • Java (jdk) 1.8 or higher
  • maven 3
  • a standalone punchplatform.

If not already done, install and start your punchplatform. That takes a few seconds.Install it with a spark runtime.

 

[~]$ unzip punchplatform-standalone-4.0.0.zip
[~]$ cd punchplatform-standalone-4.0.0
[standalone]$ ./install.sh -s --with-spark
   .... <skipped> ...
[standalone]$ source ˜~/.bashrc
[standalone]$ punchplatform-admin.sh --start

Next, create a new maven project

[~]$ mvn archetype:generate -DgroupId=com.example -DartifactId=spark-example -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

In there you have the startup com.example.App class.  Overwrite it with the following code.  It simply takes a plain file as input data, and count the number of lines containing characters ‘a’ or ‘b’.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;

public class App {
  public static void main(String[] args) {

    if (args.length < 1) {
        System.out.println("ERROR:missing input file path parameter");
        System.exit(1);
    } 
      
    String filepath = args[0];
        
    SparkSession spark = SparkSession
        .builder()
        .appName("Punch Spark Example")
        .getOrCreate();
    
    Dataset<String> fileData = spark.read().textFile(filepath).cache();

    long numAs = fileData.filter(s -> s.contains("a")).count();
    long numBs = fileData.filter(s -> s.contains("b")).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    spark.stop();
  }
}