Présentation

The way to the best Transports Publics "coverage" point ?
Manipulations Pig, Neo4j, Elasticsearch des données OpenData : Bus, Métro, Tramway, Vélo de Toulouse.

Milestones

Pig environment.

REGISTER /data31tech/elephant-bird-4.9/lib/json-simple-1.1.1.jar;
REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-core-4.9.jar;
REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-pig-4.9.jar;
REGISTER /data31tech/dev/data31tech-0.1.jar;
 
bus = LOAD '/data31tech/input/arrets-de-bus.JBD.geojson' USING fr.data31tech.pig.opendata.tisseo.JsonLoader('-nested');

Data recomputing, smart refactoring with Pig.

bus_0 = FOREACH bus GENERATE FLATTEN((bag{tuple(tuple(chararray,chararray,chararray,bag{...  );
bus_1 = FOREACH bus_0 GENERATE FLATTEN($0);
bus_2 = FOREACH bus_1 GENERATE $2, $8, STRSPLIT(BagToString($3), '_');
bus_3 = FOREACH bus_2 GENERATE $0, $1, (float)$2.$0, (float)$2.$1;
bus_4 = FOREACH bus_3 GENERATE * AS (nom:chararray, ligne:chararray, longitude:float, latitude:float);
...

Output.

grunt> describe bus_4
bus_4: {nom: chararray,ligne: chararray,longitude: float,latitude: float}
 
grunt> illustrate bus_4
...
-------------------------------------------------------------------------------------------------------------------
| bus_4     | nom:chararray              | ligne:chararray             | longitude:float     | latitude:float     |
-------------------------------------------------------------------------------------------------------------------
|           | Cabarette                  | 19                          | 43.635036           | 1.4614924          |
|           | Cadène                     | 59                          | 43.653927           | 1.4233674          |
|           | Calicéo                    | 43                          | 43.649242           | 1.504343           |
...
 
grunt> explain bus_4
...
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-157
Map Plan
bus_4: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-156
|
|---bus_4: New For Each(false,false,false,false)[bag] - scope-155
    |   |
    |   Project[chararray][0] - scope-143
    ...
    |
    |---bus_2: New For Each(false,false,false)[bag] - scope-142
        |   |
        |   Project[chararray][2] - scope-133
        ...
        |
        |---bus_1: New For Each(true)[bag] - scope-132
            |   |
            |   Project[tuple][0] - scope-130
            ...
            |
            |---bus_0: New For Each(true)[bag] - scope-129
                |   |
                |   Cast[bag:{((chararray,chararray,chararray,{(float)},chararray,chararray,chararray,float,...
                |   |
                |   |---Project[bag][2] - scope-126
                |       |
                |       |---Project[bag][1] - scope-125
                |
                |---bus: Load(...) - scope-124--------
Global sort: false
----------------

Un peu de shell "des familles" pour DataScientists avertis.

#!/bin/bash
 
input="${1:-/dev/stdin}"
#output="${2:-/dev/stdout}"
#error="${3:-/dev/stderr}"
 
if [ ! -z "$1" ] && [ ! -f "$1" ]
then
  echo "ERROR: input n'est pas un fichier"
  exit -1
fi
 
echo "Transformation de ["$input"] / tramway(nom,ligne)|bus(nom,ligne)|metro(nom,ligne)|velo(nom,rue)"
 
aIndice=0
while read aLine;
do
  aIndice=$((aIndice + 1))
  aLine=`echo $aLine|sed -r 's/^(\{\"nom\"\:\"(.*)\"\,\"rue\".*)$/\(\2\:velo \1\)\;/g'|sed -r 's/\"nom\"/nom.../g'`
  aString1=`echo $aLine|cut -f 1 -d ':'|sed -r 's/[ ]+/_/g'|sed -r 's/[.]+/_/g'...'s/^.([0-9]+.*)$/\(_\1/g'.../g'`
  aString2=`echo $aLine|cut -f 2- -d ':'`
  echo "CREATE "$aString1:$aString2
done
...

Chargement Neo4j.

hdfs dfs -cat /data31tech/output/velo_gen/part-m-00000 | /data31tech/dev/pig2neo.sh > /data31tech/dev/velo.insert.n4j
$NEO4J_HOME/bin/neo4j-shell -file /data31tech/dev/velo.insert.n4j
...


Chargement Elasticsearch.

REGISTER /data31tech/elasticsearch-hadoop-2.1.1/dist/elasticsearch-hadoop-pig-2.1.1.jar;
STORE metro_4 INTO 'metro/json' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes=localhost:9200');
...

TODOs

L'excellent Pig Latin, le dernier langage ?
Problèmes avec les structures de données complexes, hélas. Un niveau de service old school BeanUtil / Velocity pour addresser le JSON...
JSon the new XML. L'objet de toutes les attentions: Hortonworks (a first bad experience hue), PiggyBank, Elephant Bird et fr.data31tech.pig.opendata.tisseo.JsonLoader.

!! Malgré les $0.$0.$0 tu ne rentres jamais dans le tuple (etat,nom...) ??
metro_0 = FOREACH metro GENERATE features.$2.$0.$0.$0.$0.$0.$0;
describe metro_0
metro_0: {{(properties: (etat: chararray,nom: chararray,ligne: chararray,geo_point_2d: {(value: float)}))}}
 
!! Le cast marche, la structure de données est valide, le message d'erreur est un peu surréaliste ??
grunt> metro_0 = FOREACH metro GENERATE (bag{tuple(tuple(chararray,chararray,chararray,bag{tuple(float)}))})features...
2015-07-22 11:42:52,970 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1128: Cannot find field nom in properties:tuple...

2 excellents liens : Pig, Neo4j.

Explorations continue to reach the best Transports Publics "coverage" point.