Extracting geo coordinates from a complex nested Twitter json, using Python

I am reading multiple complex json files and trying to extract geo coordinates. I cannot attach the file itself right now, but I can print the tree here. The file has several hundred options and some objects repeat.

Please see the structure of the file in .txt format.

When I read the json with Spark in Python, it shows me these coordinates in coordinates column and it is there.

It is stored in coordinates column. Please see a proof.

I am obviously trying to reduce the number of columns and select some columns.

The last two columns are my geo coordinates. I tried both coordinates and geo and also coordinates.coordinates with geo.coordinates. Both options do not work.

df_tweets = tweets.select(['text', 
                       'user.name', 
                       'user.screen_name', 
                       'user.id', 
                       'user.location',  
                       'place.country', 
                       'place.full_name', 
                       'place.name',
                       'user.followers_count', 
                       'retweet_count',
                       'retweeted',
                       'user.friends_count',
                       'entities.hashtags.text', 
                       'created_at', 
                       'timestamp_ms', 
                       'lang',
                       'coordinates.coordinates', # or just `coordinates`
                       'geo.coordinates' # or just `geo`
                       ])

In the first case with coordinates and geo I get the following, printing the schema:

df_tweets.printSchema()

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)

When I do coordinates.coordinates and geo.coordinates, I get

root
 |-- text: string (nullable = true)
 |-- name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- location: string (nullable = true)
 |-- country: string (nullable = true)
 |-- full_name: string (nullable = true)
 |-- name: string (nullable = true)
 |-- followers_count: long (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- retweeted: boolean (nullable = true)
 |-- friends_count: long (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- created_at: string (nullable = true)
 |-- timestamp_ms: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)

When I print both dataframes in Pandas, none of them gives me coordinates, I still have None.

How to extract geo coordinates properly?



Read more here: https://stackoverflow.com/questions/66324092/extracting-geo-coordinates-from-a-complex-nested-twitter-json-using-python

Content Attribution

This content was originally published by Anakin Skywalker at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: