The way to the best Transports Publics "coverage" point ?
Manipulations Pig, Neo4j, Elasticsearch des données OpenData : Bus, Métro, Tramway, Vélo de Toulouse.

Pig environment.
REGISTER /data31tech/elephant-bird-4.9/lib/json-simple-1.1.1.jar;
REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-core-4.9.jar;
REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-pig-4.9.jar;
REGISTER /data31tech/dev/data31tech-0.1.jar;
bus = LOAD '/data31tech/input/arrets-de-bus.JBD.geojson' USING fr.data31tech.pig.opendata.tisseo.JsonLoader('-nested');
Data recomputing, smart refactoring with Pig.
bus_0 = FOREACH bus GENERATE FLATTEN((bag{tuple(tuple(chararray,chararray,chararray,bag{... );
bus_1 = FOREACH bus_0 GENERATE FLATTEN($0);
bus_2 = FOREACH bus_1 GENERATE $2, $8, STRSPLIT(BagToString($3), '_');
bus_3 = FOREACH bus_2 GENERATE $0, $1, (float)$2.$0, (float)$2.$1;
bus_4 = FOREACH bus_3 GENERATE * AS (nom:chararray, ligne:chararray, longitude:float, latitude:float);
...
Output.
grunt> describe bus_4
bus_4: {nom: chararray,ligne: chararray,longitude: float,latitude: float}
grunt> illustrate bus_4
...
-------------------------------------------------------------------------------------------------------------------
| bus_4 | nom:chararray | ligne:chararray | longitude:float | latitude:float |
-------------------------------------------------------------------------------------------------------------------
| | Cabarette | 19 | 43.635036 | 1.4614924 |
| | Cadène | 59 | 43.653927 | 1.4233674 |
| | Calicéo | 43 | 43.649242 | 1.504343 |
...
grunt> explain bus_4
...
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-157
Map Plan
bus_4: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-156
|
|---bus_4: New For Each(false,false,false,false)[bag] - scope-155
| |
| Project[chararray][0] - scope-143
...
|
|---bus_2: New For Each(false,false,false)[bag] - scope-142
| |
| Project[chararray][2] - scope-133
...
|
|---bus_1: New For Each(true)[bag] - scope-132
| |
| Project[tuple][0] - scope-130
...
|
|---bus_0: New For Each(true)[bag] - scope-129
| |
| Cast[bag:{((chararray,chararray,chararray,{(float)},chararray,chararray,chararray,float,...
| |
| |---Project[bag][2] - scope-126
| |
| |---Project[bag][1] - scope-125
|
|---bus: Load(...) - scope-124--------
Global sort: false
----------------
Un peu de shell "des familles" pour DataScientists avertis.
#!/bin/bash
input="${1:-/dev/stdin}"
#output="${2:-/dev/stdout}"
#error="${3:-/dev/stderr}"
if [ ! -z "$1" ] && [ ! -f "$1" ]
then
echo "ERROR: input n'est pas un fichier"
exit -1
fi
echo "Transformation de ["$input"] / tramway(nom,ligne)|bus(nom,ligne)|metro(nom,ligne)|velo(nom,rue)"
aIndice=0
while read aLine;
do
aIndice=$((aIndice + 1))
aLine=`echo $aLine|sed -r 's/^(\{\"nom\"\:\"(.*)\"\,\"rue\".*)$/\(\2\:velo \1\)\;/g'|sed -r 's/\"nom\"/nom.../g'`
aString1=`echo $aLine|cut -f 1 -d ':'|sed -r 's/[ ]+/_/g'|sed -r 's/[.]+/_/g'...'s/^.([0-9]+.*)$/\(_\1/g'.../g'`
aString2=`echo $aLine|cut -f 2- -d ':'`
echo "CREATE "$aString1:$aString2
done
...
Chargement Neo4j.
hdfs dfs -cat /data31tech/output/velo_gen/part-m-00000 | /data31tech/dev/pig2neo.sh > /data31tech/dev/velo.insert.n4j $NEO4J_HOME/bin/neo4j-shell -file /data31tech/dev/velo.insert.n4j ...
Chargement Elasticsearch.
REGISTER /data31tech/elasticsearch-hadoop-2.1.1/dist/elasticsearch-hadoop-pig-2.1.1.jar;
STORE metro_4 INTO 'metro/json' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes=localhost:9200');
...
L'excellent Pig Latin, le dernier langage ?
Problèmes avec les structures de données complexes, hélas. Un niveau de service old school BeanUtil / Velocity pour addresser le JSON...
JSon the new XML. L'objet de toutes les attentions: Hortonworks (a first bad experience hue), PiggyBank, Elephant Bird et fr.data31tech.pig.opendata.tisseo.JsonLoader.
!! Malgré les $0.$0.$0 tu ne rentres jamais dans le tuple (etat,nom...) ??
metro_0 = FOREACH metro GENERATE features.$2.$0.$0.$0.$0.$0.$0;
describe metro_0
metro_0: {{(properties: (etat: chararray,nom: chararray,ligne: chararray,geo_point_2d: {(value: float)}))}}
!! Le cast marche, la structure de données est valide, le message d'erreur est un peu surréaliste ??
grunt> metro_0 = FOREACH metro GENERATE (bag{tuple(tuple(chararray,chararray,chararray,bag{tuple(float)}))})features...
2015-07-22 11:42:52,970 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1128: Cannot find field nom in properties:tuple...2 excellents liens : Pig, Neo4j.
Explorations continue to reach the best Transports Publics "coverage" point.