— Next episode of the /usr/bin/grep saga —

The incontournable logs aggregation Hortonworks, MapR and myDistrib 2.6.X has been continued #1@2015 ( 2.7.1), #2@2016 ( 2.7.2), #2.1@2016.


Milestones

&(ElasticSearch->grep) = &(*ElasticSearch.grep) ?


Spark cluster ready.


From oldschool grep to org.apache.hadoop.examples.Grep (grep 2.0) to xgrep.py (grep 2.1).


MapReduce grep 2.0, already oldschool in the BigData timescale ?

 
hadoop org.apache.hadoop.examples.Grep hdfs://vm01.jbdata.fr:9000/logstash/2016-05-12 hdfs://vm01.jbdata.fr:9000/_output "INFO"
 

Spark-Python grep 2.1, soon oldschool in the roadmap ?

 
/usr/local/spark/bin/spark-submit /data31tech/dev/spark-00/xgrep.py /test/lstash-00/2016-06-10/hadoop-hduser-datanode-vm01.log ERROR
 

1st level, SparkContext release.

#
# @author JBD-2016-07
# http://data31tech.fr
#
 
from __future__ import print_function
 
import sys
 
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
 
if __name__ == "__main__":
  if len(sys.argv) != 3:
    print("Usage: xgrep file pattern", file=sys.stderr)
    exit(-1)
 
  aSparkContext = SparkContext(appName="xgrep@JBD")
  aFile = aSparkContext.textFile(sys.argv[1])
  theErrors = aFile.filter(lambda line, pattern=sys.argv[2] : pattern in line)
  print("Results#SparkContext:")
  print(theErrors)
  print("Number of {} : {}".format(sys.argv[2], theErrors.count()))
  # for anItems in theErrors.collect():
  #     print(anItems)
  aSparkContext.stop()

2nd level, SparkSession release. 3rd level, RDD for another CaseStudy#*.

#
# @author JBD-2016-07
# http://data31tech.fr
#
 
from __future__ import print_function
 
import sys
 
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
 
if __name__ == "__main__":
  if len(sys.argv) != 3:
    print("Usage: xgrep file pattern", file=sys.stderr)
    exit(-1)
 
  aSparkSQLSession = SparkSession.builder.appName("xgrep@JBD").getOrCreate()
  theLines = aSparkSQLSession.read.text(sys.argv[1])
  theErrors = theLines.filter("value like '%ERROR%'")
  print("Results#SparkSQLSession:")
  print(theLines)
  print(theErrors)
  theLines.show()
  theErrors.show()
  aSparkSQLSession.stop()

From ASM, C/C++, SDK32/MFC, JAVA, 2D/3D, PHP, Androïd, HTML5 to Pig Latin, jruby, python by the way Shell(s), PERL, IDL, OCCAM, LISP, PROLOG, APT, TODO#?

TODOs

The incontournable logs aggregation Hortonworks, MapR and myDistrib 2.6.X DONE#1@2015, DONE#2@2016, DONE#2.1@2016, TODO#3.