Student Teaches Technology to Decipher the “New Languages” of Code-Switching
Swiftly, subtly, without most of us even realizing, the world has adopted a whole new set of languages, explains Arunavha Chanda, a junior studying computer engineering.
Social media and texting—rapid-fire messaging replete with contractions, acronyms, in-jokes, emojis, typos, and made-up words—are creating evolving vernaculars that are vexing enough for computers to decode, but multilingual users present a vaster problem. How can machines understand "code-switching"—humans' use of multiple languages in single utterances—especially involving sparsely documented languages like Bengali that are casually transliterated into Roman script?
With more than half of Twitter's traffic now involving code-switching or languages other than English, deciphering text is becoming increasingly important for sentiment analysis, speech recognition for services like Siri, and even detecting potential terrorists in our increasingly connected world.
Chanda took on the challenge of helping technology decipher code-switching by designing artificial intelligence capable not only of recognizing sequences of characters as words in different languages but of figuring out which language and meaning are intended. He began by compiling lexicons of English, Spanish, and Bengali-especially challenging, as even sequences of just two characters can be transliterated at least seven different ways-and designed rule-based algorithms for assessing which language is most likely. Then, he gathered an extensive corpus of Bengali-English Facebook chats to serve as a training set for machine learning algorithms he developed for considering context to help determine what users mean. He is also making his work publicly available for future research.
"It's an extremely fertile area for getting things wrong," said Chanda, a recipient of the C. Prescott Davis and Prentice C. Hiam scholarships. "Even normal social media is very complex. But code-switching is how language works now. It's a fact of life." Bengali, for example, is the seventh most widely spoken language in the world, used by millions of people who may also speak English and/or Hindi.
Chanda was the youngest presenter at the Empirical Methods in Natural Language Processing (EMNLP) conference last November in Austin, Texas, concerning the branch of computer science dealing with recognizing and processing human language and speech. He was invited to present his research after the Association for Computational Linguistics published two of his papers.
"I had an incredible time and it was an extremely proud moment for me," said the aspiring entrepreneur, who serves as a teaching assistant for data structures in the computer science department. "I got to meet stalwarts and legends whose work I had read and cited, and everyone was surprised to learn that I was an undergraduate."
Chanda, who became intrigued with natural language processing while at home in India, is also interested in algorithm development, logic design, and computer architecture. He has been recognized by his peers for his research, as well. In January, he was a plenary speaker at the National Collegiate Research Conference, an international, multidisciplinary conference for undergraduate students.