I am reading multiple complex json files and trying to extract geo coordinates. I cannot attach the file itself right now, but I can print the tree here. The file has several hundred options and some objects repeat.
Please see the structure of the file in .txt
format.
When I read the json with Spark in Python, it shows me these coordinates in coordinates
column and it is there.
It is stored in coordinates
column. Please see a proof.
I am obviously trying to reduce the number of columns and select some columns.
The last two columns are my geo coordinates. I tried both coordinates
and geo
and also coordinates.coordinates
with geo.coordinates
. Both options do not work.
df_tweets = tweets.select(['text',
'user.name',
'user.screen_name',
'user.id',
'user.location',
'place.country',
'place.full_name',
'place.name',
'user.followers_count',
'retweet_count',
'retweeted',
'user.friends_count',
'entities.hashtags.text',
'created_at',
'timestamp_ms',
'lang',
'coordinates.coordinates', # or just `coordinates`
'geo.coordinates' # or just `geo`
])
In the first case with coordinates
and geo
I get the following, printing the schema:
df_tweets.printSchema()
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- geo: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
When I do coordinates.coordinates
and geo.coordinates
, I get
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
When I print both dataframes in Pandas, none of them gives me coordinates, I still have None
.
How to extract geo coordinates properly?
Read more here: https://stackoverflow.com/questions/66324092/extracting-geo-coordinates-from-a-complex-nested-twitter-json-using-python
Content Attribution
This content was originally published by Anakin Skywalker at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.