Jens A. de Bruijn, Hans de Moel, Brenden Jongman, Jurjen Wagemaker, Jeroen C.J.H. Aerts
Abstract: Timely and accurate information about ongoing events are crucial for relief organizations seeking to effectively respond to disasters. Recently, social media platforms, especially Twitter, have gained traction as a novel source of information on disaster events. Unfortunately, geographical information is rarely attached to tweets, which hinders the use of Twitter for geographical applications. As a solution, geoparsing algorithms extract and can locate geographical locations referenced in a tweet’s text. This paper describes TAGGS, a new algorithm that enhances location disambiguation by employing both metadata and the contextual spatial information of groups of tweets referencing the same location regarding a specific disaster type. Validation demonstrated that TAGGS approximately attains a recall of 0.82 and precision of 0.91. Without lowering precision, this roughly doubles the number of correctly found administrative subdivisions and cities, towns and villages as compared to individual geoparsing. We applied TAGGS to 55.1 million flood-related tweets in 12 languages, collected over 3 years. We found 19.2 million tweets mentioning one or more flood locations, which can be towns (11.2 million), administrative subdivisions (5.1 million), or countries (4.6 million). In the future, TAGGS could form the basis for a global event detection system.
Keywords: geoparsing, geocoding, geotagging, geolocation, floods, Twitter, disaster response
Find the code on GitHub
Download all events here.
Shapefiles with location_IDs
Jens A. de Bruijn1,2, Hans de Moel1, Brenden Jongman1,3, Marleen C. de Ruiter1, Jurjen Wagemaker2, Jeroen C.J.H. Aerts1
1 Institute for Environmental Studies, VU University, De Boelelaan 1087, 1081HV Amsterdam, The Netherlands
2 FloodTags, Binckhorstlaan 36 M2.11, The Hague, 2516 BE, The Netherlands
3 Global Facility for Disaster Reduction and Recovery, World Bank Group, Washington D.C., 20433, USA
Over the last 10 years, floods have caused 400 billion euros in damage and caused almost 60.000 casualties. Research shows that rapid response efforts are often hampered due to a lack of timely and useful information. Usually, floods are detected and monitored using hydrological models or satellite imagery. However, many flood events remain unreported and the average time-lapse between start of a flood and flood detected by response organizations is large. More recently, people and organizations have increasingly started using information from online media (e.g., Twitter, Facebook, WhatsApp, news articles and blog posts) to monitor flood events.
As part of ongoing research into the use of online media in flood monitoring, researchers at the Institute for Environmental Studies (IVM - VU University Amsterdam) and FloodTags released a new tool that globally detects and monitors flood events. It provides a real-time overview of ongoing flood events based on filtered Twitter data. Specifically, the global flood monitor (GFM) detects, in real-time, regions with enhanced flood-related Twitter activity and classifies these as flood events. Then, it generates a world-map visualizing these events (Figure 1) and their relevant tweets. The platform also provides access to historical events dating back to July 2014.
FloodTags collects, among other data, real-time Twitter data using Twitter’ streaming API. The GFM utilizes this data in 12 languages using the keywords as specified in (Table 1).
|English||flood, floods, flooding, flooded, inundation, inundations, inundated|
|Indonesian||banjir, banjirjkt, bantubanjir|
|Filipino||baha, bumabaha, pagbaha|
|German||flut, hochwasser, Überflutung|
|Italian||inondazione, inondacioni, alluvione|
|Serbian||poplava, poplave, поплава, поплаве|
|Portuguese||inundação, inundacão, inundaçao, inundacao, inundações|
|Spanish||inundación, inundacion, inundar, inundaciones|
|Turkish||su taşkın, su baskını, sel bastı, sel suyu, sel yüzünden, taşkın oldu, sel suyunun|
On average this amounts to roughly 75,000 flood-related tweets a day. Naturally, the number of tweets highly varies depending on the characteristics of currently ongoing flood events. For example, when Hurricane Harvey made landfall in the USA, upwards of 600,000 tweets were posted within 24 hours. First, these tweets are filtered using a blacklist, discarding all tweets mentioning words such as “protests”, “smuggled” and “timeline”.
To detect enhanced Twitter activity in regions, locations need to be attached to tweets. Unfortunately, merely ~2% of tweets have the GPS location of the user at the time of posting available. An additional problem in using these GPS locations is that when a major flood event occurs, such as the hurricanes that hit several countries around the Caribbean Sea and the Gulf of Mexico, these events might receive news coverage from all around the world. This might result in enhanced flood-related activity in many locations around the world.
Therefore, we created the TAGGS-algorithm1,2 (Toponym-based Algorithm for Grouped Geoparsing of Social media) to find mentions of locations (i.e., countries, administrative subdivision, cities, towns and villages) in tweets. This roughly employs two steps: 1) toponym recognition and 2) toponym disambiguation. In the first step the sentence is split up into individual words (unigram) as well sequences of individual words up to a length of 3 (bigrams and trigrams). These n-grams are then matched to the near-comprehensive set of geographical locations (gazetteer) as created using the GeoNames database3 (Figure 2).
Unfortunately, many place names (toponyms) can refer to multiple locations (e.g., Boston, UK and Boston, Massachusetts, USA). To disambiguate the toponyms, the algorithm first groups all tweets mentioning the same toponyms within a 24-hour timeframe. Then for all tweets within these groups, additional spatial indicators, such as user time zone, user home town, GPS location and other location mentions in a tweet’s text are analyzed. Based on these indicators the most likely location is selected for all tweets within the group (Figure 3).
The GFM conducts event detection at the level of a country and their first order administrative subdivisions (e.g., provinces in the Netherlands and states in the USA). Based on the locations mentioned, tweets are assigned to these regions. Tweets mentioning a country are assigned to the country and tweets mentioning a first order administrative subdivision or a geographic entity therein are assigned to the first order administrative subdivisions.
Then, burst detection is performed by analyzing the time difference between several consecutive tweets assigned to a region. When the time difference between several consecutive tweets falls below a region-specific threshold, this burst is classified as a flood event. An example thereof is given in Figure 4 for the Rift Valley Province in Kenya.
Early event detection and response can significantly decrease flood impacts. Currently, early warning systems rely on gauges, radar data, informal local sources and hydrological models. Recently, the use of social media in detecting floods has shown promising results. Due to the large data and lack of effective algorithms, efforts to detect floods using social have been limited to short time spans and small geographical areas. This global flood monitor is based on a new algorithm that uses Twitter to locate and detect floods in real-time on a global scale. Click here for more information.
Move the slider, or use the play buttons to display floods (highlighted in blue) through time.