ML News Monthly – Dec 2020

Welcome to the third edition of ML News Monthly – Dec 2020!!

Here are the key happenings this month in the Machine Learning field that I think are worth knowing about. 🕸


1. AI+X: Don’t Switch Careers, Add AI (to it)

In this article, the author says that – rather than abandoning your current career track to become a data scientist or a machine learning engineer, consider developing AI skills to complement your existing subject matter expertise.

Toyota and Hyundai are looking for people with AI and polymer electrolyte membrane (PEM) expertise to build fuel cells that back up data centers, banks, hospitals, and telecommunication companies. Similarly, Biotechnology giant Illumina seeks talent with AI and DNA sequencing expertise to develop novel algorithms for genome interpretation.

https://kiankatan.medium.com/ai-x-dont-switch-careers-add-ai-34eff21dd3e1

2. AWS and NVIDIA achieve the fastest training times for Mask R-CNN and T5-3B

AWS has released new SageMaker distributed training libraries, which provide the easiest and fastest way to train deep learning models by automatically and efficiently splitting a model across multiple NVIDIA GPUs. This technology was used to achieve record training times for Mask R-CNN and T5-3B.

https://aws.amazon.com/blogs/machine-learning/aws-and-nvidia-achieve-the-fastest-training-times-for-mask-r-cnn-and-t5-3b/

3. Privacy Considerations in Large Language Models – Google Blog

One of the risks with Language models (used for predicting the next word) is – their potential to leak details from the data on which they were trained. Issues may arise if a model trained on private data were to be made publicly available. This article talks about issues with large Language models like GPT2, the ethical considerations of Data attack and possible solutions to overcome such issues.

http://ai.googleblog.com/2020/12/privacy-considerations-in-large.html

4. MLOps Tooling Landscape v2 (+84 new tools) – Dec ’20

In this article, there’s a summary of 284 MLOps tools that are available in the market. In the list of 284 MLOps tools, there are 180 startups. Out of these 180 startups, 65 raised money in 2020. Most startups that raised money in 2020 are in still in the Data pipeline category, with an increasing number of in all-in-one (end-to-end platforms), hardware, and serving.

https://huyenchip.com/2020/12/30/mlops-v2.html

5. ‘Papers with Code’ 2020 Review

This article summarizes the top trending papers, libraries and benchmarks for 2020 on Papers with Code.

https://medium.com/paperswithcode/papers-with-code-2020-review-938146ab9658

6. How fast is C++ compared to Python?

https://towardsdatascience.com/how-fast-is-c-compared-to-python-978f18f474c7

7. Two-step Classification using Recasted Data in Low Resource Settings

An NLP model’s ability to reason shouldbe independent of language. Previous works utilize Natural Language Inference(NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, Authors use data recasting to create four NLI datasets from existing four text classification datasets in Hindi language. Through experiments, Authors show that the recasted dataset is devoid of statistical irregularities and spurious patterns.

https://shagunuppal.github.io/publication/Hindi_NLI/

8. Using AutoML for Time Series Forecasting

Google has introduced a scalable end-to-end AutoML solution for time series forecasting, which meets three key criteria:

  • Fully automated: The solution takes in data as input, and produces a servable TensorFlow model as output with no human intervention.
  • Generic: The solution works for most time series forecasting tasks and automatically searches for the best model configuration for each task.
  • High-quality*:*** The produced models have competitive quality compared to those manually crafted for specific tasks.

http://ai.googleblog.com/2020/12/using-automl-for-time-series-forecasting.html

9. NLP for Healthcare in the Absence of a Healthcare Dataset

10. In Search of Best Practices for NLP Projects – Ivan Bilan

This video explores the best practices for building NLP and ML projects from the ground-up and up to production.

11. Train generator

This is a web app to generate template code for machine learning. traingenerator offers multiple options for preprocessing, model setup, training, and visualization (using Tensorboard or comet.ml). It exports to .py, Jupyter Notebook, or Google Colab.

https://github.com/jrieke/traingenerator

12. Chess2Vec — Map of Chess Moves

In this interesting post, author analyzes which moves in a game of chess are close that they often occur in similar situations in games. If two moves often occur after or before the same moves, then these moves are similar in a certain sense.

https://towardsdatascience.com/chess2vec-map-of-chess-moves-712906da4de9

13. Voice Separation with an Unknown Number of Multiple Speakers (ICML 2020)

Facebook research presents a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample.

https://github.com/facebookresearch/svoice

https://arxiv.org/pdf/2003.01531.pdf

14. magenta: Empowering Creative Agency with Machine Learning

https://slideslive.com/38938169/magenta-empowering-creative-agency-with-machine-learning?s=09

15. Simplified model lifecycle management with MLOps on Google Cloud

https://cloudonair.withgoogle.com/events/mlops-google-cloud

Papers


16. On Generating Extended Summaries of Long Documents

Prior work in document summarization has mainly focused on generating short summaries of a document. This paper exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach.

https://arxiv.org/abs/2012.14136

17. EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP

The hefty computational and memory demands of Transformer-based language models such as BERT, make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. This paper presents EdgeBERT an in-depth and principled algorithm and hardware design methodology to achieve minimal latency and energy consumption on multi-task NLP inference.

https://arxiv.org/abs/2011.14203

18. Domain specific BERT representation for Named Entity Recognition.

The vocabulary used in the medical field contains a lot of different tokens used only in the medical industry such as the name of different diseases, devices, organisms, medicines, etc. that makes it difficult for traditional BERT model to create contextualized embedding. This paper illustrates the System for Named Entity Tagging based on Bio-Bert.

https://arxiv.org/abs/2012.11145

19. Transformer protein language models are unsupervised structure learner

https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1

Courses / Resources


20. The Best MIT Online Resources for You to Learn AI and Machine Learning for Free

https://medium.com/swlh/the-best-mit-online-resources-for-you-to-learn-ai-and-machine-learning-for-free-d3ba1e50f436

21. Deep Learning Lectures – Sebastian Raschka

https://sebastianraschka.com/resources/dl-lectures.html

22. Yann LeCun’s Deep Learning Course at CDS

23. Best NLP competitions on Kaggle (to learn from)

24. Awesome Fraud Detection Papers

https://github.com/benedekrozemberczki/awesome-fraud-detection-papers

25. Awesome Decision Tree Research Papers

https://github.com/benedekrozemberczki/awesome-decision-tree-papers


That’s it !!

Let me know if I missed anything or if there’s anything you think should be included in a future post.