Ability to design, build and unit test applications on Spark framework on Scala and Python.
Build Spark based applications for both batch and streaming requirements, which will require in-depth knowledge on majority of Hadoop and NoSQL databases as well.
Develop and execute data pipeline testing processes and validate business rules and policies
Optimize performance of the built Spark applications in Hadoop using configurations around Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
Optimize performance for data access requirements by choosing the appropriate native Hadoop file formats (Avro, Parquet, ORC etc) and compression codec respectively.
Ability to design & build real-time applications using Apache Kafka & Spark Streaming