ML News Monthly – Jan 2021

Welcome to the fourth edition of ML News Monthly – Jan 2021!!

Here are the key happenings this month in the Machine Learning field that I think are worth knowing about. 🕸


1) SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY

Google researchers developed and benchmarked techniques they claim enabled them to train a language model containing more than a trillion parameters. They say their 1.6-trillion-parameter model, which appears to be the largest of its size to date, achieved an up to 4 times speedup over the previously largest Google-developed language model (T5-XXL).

https://arxiv.org/pdf/2101.03961.pdf

2) On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size.

In this paper, authors take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks?

http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf

3) A criticism of “On the Dangers of Stochastic Parrots: Can Languae Models be Too Big”

The criticism has two parts:

  1. The paper is attacking the wrong target.
  2. The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.

https://gist.github.com/yoavg/9fc9be2f98b47c189a513573d902fb27

4) Machine Learning: The Great Stagnation

https://marksaroufim.substack.com/p/machine-learning-the-great-stagnation

5) AIs that read sentences are now catching coronavirus mutations

NLP algorithms are now able to generate protein sequences and predict virus mutations, including key changes that help the coronavirus evade the immune system. The key insight making this possible is that many properties of biological systems can be interpreted in terms of words and sentences.

https://www-technologyreview-com.cdn.ampproject.org/c/s/www.technologyreview.com/2021/01/14/1016162/ai-language-nlp-coronavirus-hiv-flu-mutations-antinbodies-immune-vaccines/amp/

6) DALL·E: Creating Images from Text

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.

https://openai.com/blog/dall-e/

7) Machine learning is going real-time

There seems to be little consensus on what real-time ML means, and there hasn’t been a lot of in-depth discussion on how it’s done in the industry. In this post, author shares what they have learnt after talking to about a dozen companies that are doing it.

https://huyenchip.com/2020/12/27/real-time-machine-learning.html

8) Let’s review productized GPT-3 together

Authors have created a comprehensive and, more importantly, collaborative map showing what entrepreneurs have built with GPT-3.

https://medium.com/cherrytales/lets-review-productized-gpt-3-together-aeece64343d7

9) Finding the Words to Say: Hidden State Visualizations for Language Models

By visualizing the hidden state between a model’s layers, we can get some clues as to the model’s “thought process”.

https://jalammar.github.io/hidden-states/

10) Insightful AI Books To Read in 2021

https://blog.crossminds.ai/post/10-ai-books-to-read-in-2021-machine-learning-researchers-engineers

11) How NLP can make travelling more accessible

IIT-Madras scientists have launched AI4Bharat to boost AI innovation in India. AI4Bharat is a community of engineers, domain experts, policymakers, and academicians collaborating to build AI solutions to solve the problems very specific to India. The ongoing projects from AI4Bharat around NLP include Signboard Translation from Vernacular Languages, Fonts for Indian Scripts, Word embeddings for Indian Languages, and many more.

https://indiaai.gov.in/article/how-nlp-can-make-travelling-more-accessible

12) A new open data set for multilingual speech research

Facebook AI is releasing Multilingual LibriSpeech (MLS), a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community’s work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.

https://ai.facebook.com/blog/a-new-open-data-set-for-multilingual-speech-research/

13) BERT for easier NLP/NLU

This article introduces BERT and covers how to use it for much better NLP / NLU tasks, sentiment classification is also presented as a case study with code.

https://www.linkedin.com/pulse/bert-easier-nlpnlu-code-included-ibrahim-sobh-phd

14) A quick guide to managing machine learning experiments

This article talks about how to organize your machine learning experiments, trials, jobs and metadata with Amazon SageMaker and gain peace of mind

https://towardsdatascience.com/a-quick-guide-to-managing-machine-learning-experiments-af84da6b060b

15) Leveraging language technology for national good: Initiatives by the Indian govt.

https://indiaai.gov.in/article/leveraging-language-technology-for-national-good-initiatives-by-the-indian-govt

Papers


16) Studying Catastrophic Forgetting in Neural Ranking Models

in this paper, Authors study, in what extent neural ranking models catastrophically forget old knowledge acquired from previously observed domains after acquiring new knowledge, leading to performance decrease on those domains.

https://arxiv.org/pdf/2101.06984.pdf

17) Can a Fruit Fly Learn Word Embeddings?

The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs.

In this work authors study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task.

https://arxiv.org/abs/2101.06887

18) DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Data augmentation techniques have been widely used to improve machine learning performance as they facilitate generalization. In this work, authors propose a novel augmentation method to generate high quality synthetic data for low-resource tagging tasks with language models trained on the linearized labeled sentences.

https://www.aclweb.org/anthology/2020.emnlp-main.488/

19) Saying No is An Art: Contextualized Fallback Responses for Unanswerable Dialogue Queries

Despite end-to-end neural systems making significant progress in the last decade for task-oriented as well as chit-chat based dialogue systems, most dialogue systems rely on hybrid approaches which use a combination of rule-based, retrieval and generative approaches for generating a set of ranked responses. Such dialogue systems need to rely on a fallback mechanism to respond to out- of-domain or novel user queries which are not answerable within the scope of the dialog system.

Authors make use of rules over dependency parses and a text-to-text transformer fine-tuned on synthetic data of question-response pairs generating highly relevant, grammatical as well as diverse questions. We per- form automatic and manual evaluations to demonstrate the efficacy of the system.

https://arxiv.org/pdf/2012.01873.pdf

https://github.com/kaustubhdhole/natural-dont-know

Courses / Resources


20) 5x Speedup on CI/CD via Github Action's Strategy.Matrix

This post will show you how to use strategy.matrix and Github Packages to reduce the time on Github workflows significantly. For the authors, these tricks manage to cut their testing time from 40 minutes to 8 minutes!

https://hanxiao.io/2021/01/24/Speedup-CI-Workflow-in-Github-Actions-via-Strategy-Matrix/

21) [Podcast] Building ML teams and finding ML jobs

https://talkpython.fm/episodes/show/298/building-ml-teams-and-finding-ml-jobs

22) Gartner: SaaS Will Be Even Bigger Than We Thought in 2022+

Gartner has increased its estimates for global enterprise and IT spend for 2021 and 2022, with Enterprise Software and SaaS the biggest beneficiary, projected to grow a stunning 10.2% in 2022.

23) ZenML

ZenML is an extensible, open-source MLOps framework for using production-ready Machine Learning pipelines - in a simple way.

https://github.com/maiot-io/zenml

24) GENIE

GENIE is a leaderboard for natural language generation tasks. To provide more accurate assessment of progress, it uses human evaluation of the entries, gathered dynamically using crowdsourcing (Amazon Mechanical Turk).

https://genie.apps.allenai.org

25) The Big Bad NLP Database

https://datasets.quantumstat.com

26) CLIP

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

https://github.com/openai/CLIP

27) TextBox

TextBox is developed based on Python and PyTorch for reproducing and developing text generation algorithms in a unified, comprehensive and efficient framework for research purpose. Library includes 16 text generation algorithms, covering two major tasks:

  • Unconditional (input-free) Generation
  • Sequence-to-Sequence (Seq2Seq) Generation, including Machine Translation and Summarization

https://github.com/RUCAIBox/TextBox

28) StrategyQA

StrategyQA is a question-answering benchmark focusing on open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy. StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.

https://allenai.org/data/strategyqa

29) The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

https://pile.eleuther.ai

30) Dashboarding with JupyterLab 3

https://blog.jupyter.org/dashboarding-with-jupyterlab-3-789fcb1a5857


That's it !!

Let me know if I missed anything or if there's anything you think should be included in a future post.