Doing that is useful to quickly prototype, to understand what your are doing, or even simply to learn the spark basics. Because the standalone punchplatform comes in automatically equipped with a Spark runtime, ready to use data injectors, Elastic components and kibana dashboards, it cannot be simpler to start playing with Spark.
Make sure you have :
- a decent Unix or Macos laptop.
- Java (jdk) 1.8 or higher
- maven 3
- a standalone punchplatform.
If not already done, install and start your punchplatform. That takes a few seconds.Install it with a spark runtime.
[~]$ unzip [~]$ cd punchplatform-standalone-4.0.0 [standalone]$ ./ -s --with-spark .... <skipped> ... [standalone]$ source ˜~/.bashrc [standalone]$ --start
Next, create a new maven project
[~]$ mvn archetype:generate -DgroupId=com.example -DartifactId=spark-example -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
In there you have the startup com.example.App class. Overwrite it with the following code. It simply takes a plain file as input data, and count the number of lines containing characters ‘a’ or ‘b’.
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.SparkSession; public class App { public static void main(String[] args) { if (args.length < 1) { System.out.println("ERROR:missing input file path parameter"); System.exit(1); } String filepath = args[0]; SparkSession spark = SparkSession .builder() .appName("Punch Spark Example") .getOrCreate(); Dataset<String> fileData =; long numAs = fileData.filter(s -> s.contains("a")).count(); long numBs = fileData.filter(s -> s.contains("b")).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); } }
<project xmlns="" xmlns:xsi="" xsi:schemaLocation=""> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>spark-example</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>spark-example</name> <url></url> <dependencies> <!-- Spark dependency --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> </project>
You are ready to go: build your app.
[spark-example]$ mvn clean package
And submit it to the punchplatform spark cluster :
[~]$ cd $PUNCHPLATFORM_CONF_DIR/../external/spark-2.2.0-bin-hadoop2.7 [spark-2.2.0-bin-hadoop2.7]$ ./bin/spark-submit --class "App" --master spark://localhost:7077 ~/target/spark-example-1.0.jar /path/to/anytextfile