The way to the best Transports Publics "coverage" point ?
Manipulations Pig, Neo4j, Elasticsearch des données OpenData : Bus, Métro, Tramway, Vélo de Toulouse.
Pig environment.
REGISTER /data31tech/elephant-bird-4.9/lib/json-simple-1.1.1.jar; REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-core-4.9.jar; REGISTER /data31tech/elephant-bird-4.9/lib/elephant-bird-pig-4.9.jar; REGISTER /data31tech/dev/data31tech-0.1.jar; bus = LOAD '/data31tech/input/arrets-de-bus.JBD.geojson' USING fr.data31tech.pig.opendata.tisseo.JsonLoader('-nested');
Data recomputing, smart refactoring with Pig.
bus_0 = FOREACH bus GENERATE FLATTEN((bag{tuple(tuple(chararray,chararray,chararray,bag{... ); bus_1 = FOREACH bus_0 GENERATE FLATTEN($0); bus_2 = FOREACH bus_1 GENERATE $2, $8, STRSPLIT(BagToString($3), '_'); bus_3 = FOREACH bus_2 GENERATE $0, $1, (float)$2.$0, (float)$2.$1; bus_4 = FOREACH bus_3 GENERATE * AS (nom:chararray, ligne:chararray, longitude:float, latitude:float); ...
Output.
grunt> describe bus_4 bus_4: {nom: chararray,ligne: chararray,longitude: float,latitude: float} grunt> illustrate bus_4 ... ------------------------------------------------------------------------------------------------------------------- | bus_4 | nom:chararray | ligne:chararray | longitude:float | latitude:float | ------------------------------------------------------------------------------------------------------------------- | | Cabarette | 19 | 43.635036 | 1.4614924 | | | Cadène | 59 | 43.653927 | 1.4233674 | | | Calicéo | 43 | 43.649242 | 1.504343 | ... grunt> explain bus_4 ... #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-157 Map Plan bus_4: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-156 | |---bus_4: New For Each(false,false,false,false)[bag] - scope-155 | | | Project[chararray][0] - scope-143 ... | |---bus_2: New For Each(false,false,false)[bag] - scope-142 | | | Project[chararray][2] - scope-133 ... | |---bus_1: New For Each(true)[bag] - scope-132 | | | Project[tuple][0] - scope-130 ... | |---bus_0: New For Each(true)[bag] - scope-129 | | | Cast[bag:{((chararray,chararray,chararray,{(float)},chararray,chararray,chararray,float,... | | | |---Project[bag][2] - scope-126 | | | |---Project[bag][1] - scope-125 | |---bus: Load(...) - scope-124-------- Global sort: false ----------------
Un peu de shell "des familles" pour DataScientists avertis.
#!/bin/bash input="${1:-/dev/stdin}" #output="${2:-/dev/stdout}" #error="${3:-/dev/stderr}" if [ ! -z "$1" ] && [ ! -f "$1" ] then echo "ERROR: input n'est pas un fichier" exit -1 fi echo "Transformation de ["$input"] / tramway(nom,ligne)|bus(nom,ligne)|metro(nom,ligne)|velo(nom,rue)" aIndice=0 while read aLine; do aIndice=$((aIndice + 1)) aLine=`echo $aLine|sed -r 's/^(\{\"nom\"\:\"(.*)\"\,\"rue\".*)$/\(\2\:velo \1\)\;/g'|sed -r 's/\"nom\"/nom.../g'` aString1=`echo $aLine|cut -f 1 -d ':'|sed -r 's/[ ]+/_/g'|sed -r 's/[.]+/_/g'...'s/^.([0-9]+.*)$/\(_\1/g'.../g'` aString2=`echo $aLine|cut -f 2- -d ':'` echo "CREATE "$aString1:$aString2 done ...
Chargement Neo4j.
hdfs dfs -cat /data31tech/output/velo_gen/part-m-00000 | /data31tech/dev/pig2neo.sh > /data31tech/dev/velo.insert.n4j $NEO4J_HOME/bin/neo4j-shell -file /data31tech/dev/velo.insert.n4j ...
Chargement Elasticsearch.
REGISTER /data31tech/elasticsearch-hadoop-2.1.1/dist/elasticsearch-hadoop-pig-2.1.1.jar; STORE metro_4 INTO 'metro/json' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes=localhost:9200'); ...
L'excellent Pig Latin, le dernier langage ?
Problèmes avec les structures de données complexes, hélas. Un niveau de service old school BeanUtil / Velocity pour addresser le JSON...
JSon the new XML. L'objet de toutes les attentions: Hortonworks (a first bad experience hue), PiggyBank, Elephant Bird et fr.data31tech.pig.opendata.tisseo.JsonLoader.
!! Malgré les $0.$0.$0 tu ne rentres jamais dans le tuple (etat,nom...) ?? metro_0 = FOREACH metro GENERATE features.$2.$0.$0.$0.$0.$0.$0; describe metro_0 metro_0: {{(properties: (etat: chararray,nom: chararray,ligne: chararray,geo_point_2d: {(value: float)}))}} !! Le cast marche, la structure de données est valide, le message d'erreur est un peu surréaliste ?? grunt> metro_0 = FOREACH metro GENERATE (bag{tuple(tuple(chararray,chararray,chararray,bag{tuple(float)}))})features... 2015-07-22 11:42:52,970 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1128: Cannot find field nom in properties:tuple...
2 excellents liens : Pig, Neo4j.
Explorations continue to reach the best Transports Publics "coverage" point.