Doing that is useful to quickly prototype, to understand what your are doing, or even simply to learn the spark basics. Because the standalone punchplatform comes in automatically equipped with a Spark runtime, ready to use data injectors, Elastic components and kibana dashboards, it cannot be simpler to start playing with Spark.
Make sure you have :
- a decent Unix or Macos laptop.
- Java (jdk) 1.8 or higher
- maven 3
- a standalone punchplatform.
If not already done, install and start your punchplatform. That takes a few seconds.Install it with a spark runtime.
[~]$ unzip punchplatform-standalone-4.0.0.zip [~]$ cd punchplatform-standalone-4.0.0 [standalone]$ ./install.sh -s --with-spark .... <skipped> ... [standalone]$ source ˜~/.bashrc [standalone]$ punchplatform-admin.sh --start
Next, create a new maven project
[~]$ mvn archetype:generate -DgroupId=com.example -DartifactId=spark-example -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
In there you have the startup com.example.App class. Overwrite it with the following code. It simply takes a plain file as input data, and count the number of lines containing characters ‘a’ or ‘b’.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
public class App {
public static void main(String[] args) {
if (args.length < 1) {
System.out.println("ERROR:missing input file path parameter");
System.exit(1);
}
String filepath = args[0];
SparkSession spark = SparkSession
.builder()
.appName("Punch Spark Example")
.getOrCreate();
Dataset<String> fileData = spark.read().textFile(filepath).cache();
long numAs = fileData.filter(s -> s.contains("a")).count();
long numBs = fileData.filter(s -> s.contains("b")).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
spark.stop();
}
}
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>spark-example</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>spark-example</name>
<url>http://maven.apache.org</url>
<dependencies>
<!-- Spark dependency -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
You are ready to go: build your app.
[spark-example]$ mvn clean package
And submit it to the punchplatform spark cluster :
[~]$ cd $PUNCHPLATFORM_CONF_DIR/../external/spark-2.2.0-bin-hadoop2.7 [spark-2.2.0-bin-hadoop2.7]$ ./bin/spark-submit --class "App" --master spark://localhost:7077 ~/target/spark-example-1.0.jar /path/to/anytextfile
0 Comments