> > > Ecrire à l'auteur < < <          

Query DVF Etalab database with AWS Athena

Simple steps to load DVF Etalab dataset in AWS S3, and query this dataset with Athena. This dataset come from the Etalab project (https://github.com/etalab/DVF-app) and contains information about official real estate transactions.

I will not explain in details what AWS S3 and Athena are. Ask to Google if you want to know more, but basically :
- S3 is the storage service
- Athena is a SQL-like request service, and can use content in S3 as data.

To query the DVF Etalab data set, do the following :
- Have a Linux machine with connectivity to Amazon Web Service, and an account on it. You must have 4GB free space on it.
- Load the DVF dataset :

# Download data set
$> for YEAR in 2014 2015 2016 2017 2018 2019; do [ ! -f full_$YEAR.csv.gz ] && wget -r -np -nH -N --cut-dirs 5 https://cadastre.data.gouv.fr/data/etalab-dvf/latest/csv/$YEAR/full.csv.gz -O full_$YEAR.csv.gz; done
# Ungzip data
$> gunzip *.gz
# Here are the data
$> ls -la
-rw-rw-r-- 1 ec2-user ec2-user 437018380 Oct 31 02:55 full_2014.csv
-rw-rw-r-- 1 ec2-user ec2-user 479101443 Oct 31 02:09 full_2015.csv
-rw-rw-r-- 1 ec2-user ec2-user 511631149 Oct 31 01:20 full_2016.csv
-rw-rw-r-- 1 ec2-user ec2-user 589057549 Oct 31 00:31 full_2017.csv
-rw-rw-r-- 1 ec2-user ec2-user 532362246 Oct 30 23:39 full_2018.csv
-rw-rw-r-- 1 ec2-user ec2-user 176088220 Oct 30 22:48 full_2019.csv
# compute MD5
$> md5sum *.csv
a10faa9cbc1fa224da345b10fa7f8d72 full_2014.csv
ec2b4ed1692ee3cccffbc946004d972b full_2015.csv
15d4bbce240560bb0e0813354ba06fd0 full_2016.csv
1b6366cad598a9afce7e02b53e0b1437 full_2017.csv
cbbe2ba7b1f109bd34c6dc3d0075c9f6 full_2018.csv
54818ba0846c5f97cc119b33ce2b16c5 full_2019.csv

Now you have the full dataset. They are in CSV format (sparated with a , ), UTF8 encoded. You can now push them on S3 :
- create two buckets in your account : one for the data (call it “dvf-data-wqa8327ddq”), and one for the query results (call it “dvf-result-wqa8327ddq”). Theses names must be unique in all accounts on all region of AWS, so use a unique number in it.
- on the data bucket, create a folder named “dvf”
- manually upload the CSV files to your bucket, or use the AWS CLI :

$> for YEAR in 2014 2015 2016 2017 2018 2019; do aws s3 cp full_$YEAR.csv s3://dvf-data-wqa8327dd/dvf/; done

Now you have a S3 bucket with all the data from DVF in it. It’s time to use the AWS Athena service.

On your AWS console, open the Athena service. Be sure to be in the same region as your S3 buckets.

If you never used the Athena service, it will prompt you for a S3 Bucket to store query results. Select the “dvf-result-wqa8327ddq” bucket you created earlier.

On the Athena console, click on a new tab to enter a SQL query, and copy/paste the following query :

CREATE EXTERNAL TABLE `dvf`(
`id_mutation` string,
`date_mutation` date,
`numero_disposition` bigint,
`nature_mutation` string,
`valeur_fonciere` decimal(12,2),
`adresse_numero` int,
`adresse_suffixe` string,
`adresse_nom_voie` string,
`adresse_code_voie` string,
`code_postal` string,
`code_commune` string,
`nom_commune` string,
`code_departement` string,
`ancien_code_commune` string,
`ancien_nom_commune` string,
`id_parcelle` string,
`ancien_id_parcelle` string,
`numero_volume` string,
`lot1_numero` string,
`lot1_surface_carrez` decimal(9,2),
`lot2_numero` string,
`lot2_surface_carrez` decimal(9,2),
`lot3_numero` string,
`lot3_surface_carrez` decimal(9,2),
`lot4_numero` string,
`lot4_surface_carrez` decimal(9,2),
`lot5_numero` string,
`lot5_surface_carrez` decimal(9,2),
`nombre_lots` int,
`code_type_local` string,
`type_local` string,
`surface_reelle_bati` decimal(9,2),
`nombre_pieces_principales` int,
`code_nature_culture` string,
`nature_culture` string,
`code_nature_culture_speciale` string,
`nature_culture_speciale` string,
`surface_terrain` decimal(12,2),
`longitude` decimal(9,7),
`latitude` decimal(9,7))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://dvf-data-wqa8327ddq/dvf'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1573641601')

Change the LOCATION field to reflect your S3 bucket name. This query will create a table called “dvf”, and will create virtual column mapped to the CSV columns.
After a few seconds, you will be able to query the dataset.

Example requests :

Number of record in the dataset :

select
count(*)
from dvf

Results :
__| _col0
1 | 15657837

Min and max transaction in Paris :

select
min(valeur_fonciere), max(valeur_fonciere)
from dvf
where code_departement='75'

Results :
__| _col0 | _col1
1 |0.15 | 1249132030.00

Average price per m² in Paris districts:

SELECT code_postal,
min(valeur_fonciere / lot1_surface_carrez),
max(valeur_fonciere / lot1_surface_carrez),
avg(valeur_fonciere / lot1_surface_carrez)
FROM dvf
WHERE code_departement='75'
AND lot1_surface_carrez is NOT NULL
GROUP BY code_postal
order by code_postal

Results :
code_postal min max avg
757.58 29708.67 11388.74
75001 0.00 5714285.71 28444.14
75002 0.00 755952.38 13098.62
75003 0.00 451388.89 12638.46
75004 0.01 3418803.42 19740.37
(...)

aws_athena.png
Interesting, isnt’it ?

Things to improve : use partitionned data to have faster queries.

Laisser un commentaire

CAPTCHA

Hey you spammer : each comments are manually validated. I do not need viagra or rolex.