Using signals of user communication from social media to predict likeliness of code-borrowing and code-mixing
By : Jobin Wilson, Ram Mohan, Muhammad Arif
Flytxt Data Science R&D Team
The lush green Indian Institute of Technology Madras (IIT) campus made a picturesque backdrop in the celebratory photos we took at the Conference on Data Sciences (CoDS) Data Challenge. This was the second year in a row that a team from Flytxt came on top at a challenging data science contest. Last year it was KDD cup, this year it is IKDD CoDS, a contest conducted by India chapter of the Association for Computing Machinery’s (ACM) Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) along with their annual conference. It was indeed a proud moment with the team securing top position based on model score, more than the joy of winning it the amount of learning one gets with such contests is immense.
This year’s challenge was from Multilingual Natural Language Processing domain, a prominent field of research within machine learning and artificial intelligence. With the availability of the sheer volume of regional content (especially textual) on social media, entertainment websites and the internet at large, information retrieval community has increasingly recognised the prominence of multilingual NLP research.
Code-mixing and code-borrowing are two important linguistic phenomena in Multilingual Natural Language Processing. When words and phrases of one language (say English), is used while communicating in another language (say Hindi), then code mixing is said to occur. This phenomenon is often seen in the communication among bilingual and multilingual speakers. For instance, English-Hindi bilingual speakers often use English words like money,cool etc. in their Hindi conversations, though they are not Hindi words. A similar linguistic phenomenon is called Code-Borrowing, wherein a word or a phrase from a foreign language is used as a part of the native vocabulary of the domain language. For instance, native Hindi speakers might use words such as ‘botal’ and ‘kaptaan’ which are actually borrowed from the English words ‘bottle’ and ‘captain’ respectively. Code-mixing could eventually lead to foreign words getting into the vocabulary of a native speaking community of a different language. For instance, a native Hindi speaker might say ‘match dekhna’ (to see the match) wherein the word ‘match’ is an example of code-borrowing.
Identification of Code-borrowing from Code-mixing helps in many aspects of multilingual information retrieval and Natural language processing. For example, if we could distinguish multilingual queries having only borrowed foreign words or phrases, then for processing these queries we need to access only monolingual documents of domain language. This ultimately reduces the computational cost of such queries.
The Challenge
The ACM IKDD CoDS 2017 Data Challenge was to develop a model that predicts the likeliness of a word to be borrowed from English to Hindi, using various signals of user communication obtained from social media. What makes this problem challenging is that there is no clear signal of code-borrowing, and the borrowing phenomenon evolves over time. Due to this dynamic nature, tracking recent conversations among people is helpful to track the likeliness of words to be borrowed and social media data is the most valuable data source for this purpose.
A social dataset consisting of approximately 0.24 million tweets from Twitter was used for estimating the likeliness of words to be borrowed. The words in each tweet were tagged as Hindi, English or Other. Based on the rules defined on word tags present within, every tweet was tagged as an English (EN) tweet, Hindi (HI) tweet or a Code Mixed Hindi (CMH) tweet using heuristics. The key idea is that a word is more likely to be borrowed from English to Hindi, if it is used mostly in Hindi tweets. The rules used for tagging tweets are shown in the table below.
Distribution of tweets tagged to each category is shown below.
Words related to trending topics are more likely to occur in the tweets. Moreover, trending topics for English and Hindi tweets might be different. The statistics derived from such a dataset is prone to noise, which will affect the performance of the model being developed. To counter this bias towards trending topics, a set of relevant tweets has to be identified. Since tweets are associated with Hashtags, a Hashtag based filter was applied to eliminate irrelevant tweets. The Hashtag filter first identifies relevant Hashtags as the ones which are associated with HI or CMH tweets. During filtering, tweets with these relevant Hashtags are selected, to form a set of relevant tweets. This set of relevant tweets represents an unbiased dataset, from which word statistics are computed for our model.
Each tweet from relevant tweets was lowercased and tokenised to words. Tokenised words from a tweet were stemmed to its root form to extract accurate statistics.
The statistics extracted for a word reflects the numbers tweets containing the word and users who have used the word in their tweets. These counts are aggregated for EN, HI and CMH tweets. The following statistics are extracted for each word whose likeliness of being code borrowed is to be computed.
RT U hi(w) : The number of users who have used the word w in their Hindi tweets.
RT U cmh(w) : The number of users who have used the word w in their CMH tweets.
RT U en(w) : The number of users who have used the word w in their English tweets.
RT T hi(w) : The number of Hindi tweets containing the word w.
RT T cmh(w) : The number of CMH tweets containing the word w.
RT T en(w) : The number of English tweets containing the word w.
The collected statistics were used for training various machine learning models such as ordinal regression, linear regression, nonlinear regression and neural networks. However, due to limited availability of training data, the machine learning models were not able to produce the best results.Hence, a hand-crafted mathematical function using the gathered statistics was formulated, which produced the best result. The hand-crafted function is:
Here, RHTUR (w) reflects user level preferences of usage of a word in a particular language whereas RHTTR (w) indicates a global usage preference. Since the number of conversations from users can vary significantly in social media datasets, we factored in both these preferences to estimate the likeliness of an English word being code borrowed to Hindi.
A detailed technical treatment is available at our paper and poster respectively:
Paper: Code-borrowedness of English words in Hindi language
Poster: Code-borrowedness of English words in Hindi language