Data Science Primer: for students doing their first data science project, with helpful Linux commands and advise.


Geolocator (string to GeoNames ID)


Mejova, Yelena, and Nicolas Kourtellis. "YouTubing at Home: Media Sharing Behavior Change as Proxy for Mobility Around COVID-19 Lockdowns." Web Science Conference (WebSci), 2021

Geo-locator that uses GeoNames location dictionary. Includes all locations from all countries that have population > 0, and all alternative names of locations (also in native languages/alphabets). Self-contained, no need to download anything from GeoNames. For US cities, includes example code to get the ZIP code using GPS coordinates via uszipcode package. Includes example code. Best used on Twitter user location field :)

Download python code and dictionaries here

Name gender classifier


Mejova, Yelena, and Víctor Suarez-Lledó. "Impact of Online Health Awareness Campaign: Case of National Eating Disorders Association." International Conference on Social Informatics. Springer, Cham, 2020.

Do you have a dataset of social media profiles where people may use names? This code uses US Social Security and National Records of Scotland baby name listings, as well as those extracted from Google+ to try locating a name in a short string and identify its gender. (Applicable to American/British names, not so much to other nationalities)

Download python code and dictionaries here


Controversy Lexicon

with: Carlos Castillo, Amy Zhang, and Nicholas Diakopoulos


Yelena Mejova, Amy X. Zhang, Nicholas Diakopoulos, Carlos Castillo. "Controversy and Sentiment in Online News". Computation+Journalism Symposium (CJ), 2014.

Using a multi-stage crowdsourced effort, we have created a lexicon of terms associated with controversial topics (primarily in the US press). We also distinguish between controversial, weakly controversial, and also provide some non-controversial terms. 

File: controversial_words.txt

Enriched American Food Lexicon

with: Sofiane Abbar and Ingmar Weber


Sofiane Abbar, Yelena Mejova, Ingmar Weber. "You Tweet What You Eat: Studying Food Consumption Through Twitter". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2015.

The foods were extracted from a large sample of food-related tweets of 210K users in the late 2013. We began by collecting 50M tweets through the Twitter Streaming API using a hand-picked keyword filter over a span of 2013/10/29 - 2013/11/29. Then we selected all geo-tagged tweets and randomly selected 210K users from US for whom we collect up to 3.2K historical tweets. We then use this dataset to bootstrap a new lexicon by crowdsourcing a labeling of these new tweets to build a Naive Bayes classifier. Finally, we select 500 most popular terms in the tweets the classifier deems to be on food-related topic and manually clean and annotate it with the below information. Ambiguous terms, including seeds, beverage, brewed, as well as food characteristics like powdered, salted, and mashed, were removed. 

File: twitter_food_calorie_lexicon.txt

Foursquare Category Hierarchy

Are you working with Foursquare data, but having difficulty with the formatting of their category documentation? The below files contain the hierarchy described on on March 10, 2015 in easy to compute and interact files.

File: Foursquare_Category_Hierarchy.txt Foursquare_Category_Hierarchy.xlsx 

Instagram restaurant tag labels


Yelena Mejova, Hamed Haddadi, Anastasios Noulas, Ingmar Weber. "#FoodPorn: Obesity Patterns in Culinary Interactions". The 5th International Conference on Digital Health, 2015.

These are the top 2000 (minus non-latin alphabet tags) collected from the Instagram images taken at restaurants across the United States during September, October, and November 2014. The tags have been labeled using Crowdflower, taking a majority label out of 3 annotations. The agreement on these tasks was very high, at 92-99% label overlap.

File: instagram_restaurant_tag_labels.txt

Names of Qatari Tribes and Families in Arabic and English

The tiny country of Qatar has a fascinating and rich history, some of which can be found in the names of the most prominent families in the country. In the list you can find the family names in Arabic and several versions in English.


Halal on Instagram: sub-topical lexicons


Yelena Mejova, Youcef Benkhedda, Khairani. #Halal Culture on Instagram. Frontiers in Digital Humanities: Big Data, 2017.

These lexicons are for Arabic, English, and Indonesian languages, and were extracted from a large collection of Instagram posts mentioning #halal (in English/Indonesian and Arabic). The topics span food, religion, animal trade, health, and supplements.


Loneliness related emotions



77 emotions extracted from a dataset of tweets mentioning loneliness in 2019-2020.

File: loneliness_emotions


Language of Politics on Twitter

AI Summer School, American University in Beirut, June 16, 2015


Tutorial Political Text Mining (Geolocation, Text Analysis, Dynamic Spatial Visualization)

Social Media for Health Research

At International Conference on Web and Social Media 2017