Neaural Architectural Search for Natural Language

12 minute read

Neaural Architectural Search for Natural Language

What is NAS?

Deep Learning has made remarkable progress over the years on a number of tasks including Speech and Image recognition , Machine Translation , Automated Driving etc. One very important part of this great progress is due to novel neural architectures. The currently used architectures are mostly developed manually by humans which is often a time and resource consuming process whilst being prone to error. To mitigate this issue, the concept of Neural Architecture Search (NAS) was introduced. NAS automates the design of Artificial Neural Networks and resides within the domain of AutoML. The methods employed in NAS can be categorized based on the search space, search strategy, and performance estimation methods. Notably, NAS has a significant overlap with hyperparameter optimization and meta-learning.

NAS Overview

As mentioned above, the methods used for NAS can be categorized in the following three dimensions:

  1. Search Space : It determines the architectures that can be represented. With prior knowledge about the requirements suitable for a given task, the search space is modified to simplify the search and reduce computational time and effort. A basic example of a search space is the chain-structured neural network, depicted in the image below:

    Chain-Structured Neural Network
    Chain-Structured Neural Network

    The search space is further categorized by 3 parameters :

    a. The maximum number of layers it can have.
    b. The kind of operation each layer can perform for example convolution, pooling etc.
    c. The hyperparameters which include the kernel size, the number of filters, strides etc.

  2. Search Strategy : The search strategy explores the search space of a neural architecture. The strategies used include classic Reinforcement Learning techniques, Random Search, Gradient based methods etc. Traditional methods include EA (Evolutionary Algorithms) in which EA is used to learn both the structure and the parameters of the network, but the more recent methods just use EA to search the structures and further use SGD (Stochastic Gradient Descent) to estimate the parameters and after which use Reinforcement learning for NAS. In this blog, we’ll explore the methods used so far and talk about how the future for NAS, especially for NLP looks like.

  3. Estimation Strategy : The main purpose of NAS is to find efficient architectures which are able to achieve highest possible performance on unseen data. Estimation strategy refers to this process. The conventional method is to split into training and validation sets however, this is extremely computationally expensive as it’d require a large number of architectures to be trained and tested. The recent approaches focus on developing strategies that can reduce these computationally expensive estimations.

State Of The Art

Having gone through multiple research papers on NAS, we’d like to proceed discussing some of the interesting papers we found to lay out the groundwork for further research. There has been extensive research on NAS for Image classification yet not a lot has been done for NAS for NLP. We’d proceed with discussing some papers. Each of the papers mentioned below uses a different search strategy which is what makes them unique.

1. Image Classification


    In this paper the authors have proposed the use of a recurrent neural network and trained the RNN with reinforcement learning in order to maximize the expected accuracy of the architectures on the validation set. The dataset used for this method was CIFAR-10 which consists of 60000 32X32 color images in 10 classes with roughly 6000 images in each class. The 60000 images have split into 50000 for training and 10000 for testing. The paper claims that their proposed approach using NAS with RL is at par with the current best known manually designed architecture achieving the test error rate of about 3.65.

    The authors haven’t just restricted this approach to the Image classification datasets but have also used this approach to come up with a novel recurrent cell architecture on the Penn Treebank dataset, which is a widely used dataset in NLP tasks used for evaluating models for sequence labeling. The resulting perplexity is 3.6 times better than the state of the art.

    An overview of Neural Architectural Search
    An overview of Neural Architectural Search
    • Methodology
      • The controller to generate architectural hyperparameters is implemented as an RNN.
      • The architecture generation is terminated when the number of layers exceeds the requirement which is fed as input beforehand on the basis of the task.
      • The neural Network is then built and trained and the accuracy is recorded.
      • The parameters of the controller RNN are then optimized to maximize the accuracy.

      The search space has been modified in accordance with the dataset and the RNN controller has been used. The biggest limitation of this approach is the computational requirements. One needs over 800 networks being trained on 800 GPUs concurrently at any time for the training process itself, therefore making it very difficult to reproduce the results.

    • Dataset Used

      The dataset used for this method was CIFAR-10 which consists of 60000 32X32 color images in 10 classes with roughly 6000 images in each class. The 60000 images have split into 50000 for training and 10000 for testing.

    • Hardware Requirements

      The first paper used the basic RL approach for NAS and the one below will use the concept of Parameter sharing in RL which is yet another popular method to improve upon the efficiency whilst reducing the computation time.


    In this paper the authors have used the concept of ENAS which is a very fast and relatively inexpensive approach for automated model design. Written below summarizes the steps used in ENAS used by the authors in this paper :

    1. The basic concept of ENAS is to use the controller to look for neural network architectures by searching for an optimal subgraph in a large computational graph.
    2. The controller trains with “policy gradient method” to select the subgraph with the maximum reward on the validation set.
    3. The chosen model is then trained.
    4. The shared parameters amongst the child networks improves greatly on the empirical performance while using much fewer GPU hours and reducing the overall time and space complexity for the task.
    5. The controller is again an RNN which decides on the edges that are activated as well as the kind of computations performed at each node in the graph.
    • Methodology

      In ENAS the search space can be represented using a single Directed Acyclic Graph (DAG). The architecture can be realized by taking a subgraph of the same.

      ENAS search space represented as DAG
      ENAS search space represented as DAG

      Here the nodes represent the local computations and the edges represent the flow of information. Such a design allows the parameters stored in each node to be shared among all the other child models.

    • Dataset Used

      CIFAR-10 and Penn Treebank Dataset.

    • Results

      • On CIFAR-10 , the architecture chosen by the model achieved a test error of 2.89 percent which is at par with the SOTA with the error rate as 2.65 percent.
      • On Penn Treebank, the architecture chosen by the model achieved a perplexity of 55.8 , which established a new SOTA.

    NOTE: The methods used in the paper above are no doubt efficient and are comparable to the State of the Arts however they are still very computationally heavy to be reproducible. Therefore, the paper below presents a fresh approach to deal with this problem, eliminating the training process and saving a lot of computation time.


    Here, the authors have used the concept that a networks’ trained accuracy could be partially predicted from its initial state before even training the model.

    Given a neural network with rectified linear units, we can, at each unit in each layer, identify a binary indicator as to whether the unit is inactive (the value is negative and hence is multiplied by zero) or active (in which case its value is multiplied by one). Fixing these indicator variables, it is well known that the network is now locally defined by a linear operator (Hanin & Rolnick, 2019). This operator is obtained by multiplying the linear maps at each layer interspersed with the binary rectification units.

    • Methodology

      The linear maps of the network are uniquely identified by a binary code corresponding to the activation pattern of the rectified linear units. The Hamming distance between these binary codes can be used to define a kernel matrix (which we denote by KH) which is distinctive for networks that perform well

    • Dataset Used


    • Results

      The paper proposes that it is possible to navigate these spaces with a search algorithm NASWOT (Neural Architectural Search Without Training) in a matter of seconds, relying on simple, intuitive observations made on initialised neural networks, that challenges more expensive black box methods involving training. Future applications of this approach to architecture search may allow us to use NAS to specialise architectures over multiple tasks and devices without the need for long training stages.

      In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network’s trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NASBench-201, NATS-Bench, and Network Design Spaces. Our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search. Code for reproducing these experiments is available here.

2. Natural Langugae Processing


    Moving away from the extensive research that’s been devoted to NAS for Image classification, this paper establishes the benchmark for NAS in the domain of NLP. It was among the initial papers we encountered on NAS for NLP, significantly influencing our subsequent research. The authors have meticulously designed the search space for recurrent neural networks, encompassing essential modifications like LSTM and GRU cells. They conducted comprehensive experiments on text datasets, training over 14k architectures within this framework.

    • Notable Contributions :

      • They have presented the first RNN derived NAS benchmark designed for NLP tasks. Trained over 14k architectures within the designed search space for the language modeling task and have further conducted an Intrinsic and Extrinsic evaluation.

      • Intrinsic evaluation entails assessing the quality of summaries through direct human judgment based on predefined norms. In this paper, the authors employed the method of evaluating word similarity using static word embeddings. On the other hand, extrinsic evaluation indirectly assesses the summaries through user performance in specific tasks utilizing those summaries. In this study, the authors utilized this evaluation form by measuring performance in downstream tasks such as the GLUE Benchmark.

      • Next, they have introduced a framework for benchmarking and compared different NAS algorithms within it. They have released all the learned architectures allowing architecture comparison for further research.

    • Datasets

      Penn Tree Bank (PTB) and WikiText2 dataset.

    • Hardware Requirement

      HPC cluster Zhores with Tesla V100-SXM2 with16Gb of memory on each GPU.


    TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space. The paper investigates two training-free and label-free indicators to rank the quality of deep architectures: the spectrum of their NTKs, and the number of linear regions in their input space.

    The paper leverages the above two theoretically-inspired indicators to establish a training-free NAS framework, TE-NAS, therefore eliminating a drastic portion of the search cost. They further introduced a pruning-based mechanism, to boost search efficiency and to more flexibly trade-off between trainability and expressivity

    • Dataset and Hardware Requirement

      In NAS-Bench-201/DARTS search spaces, TE-NAS discovers architectures with a strong performance at remarkably lower search costs, compared to previous efforts. With just one 1080Ti, it only costs 0.5 GPU hours to search on CIFAR10, and 4 GPU hours on ImageNet, respectively, setting the new record for ultra-efficient yet high-quality NAS.

3. Papers Worth Mentioning in the Domain of NLP


    Automatic Speech Recognition (ASR) has made tremendous progress in the recent past reducing the word-error-rate. The ASR models are trained with thousands of hours of high quality speech data therefore the training is very computationally expensive and time consuming. NAS has traditionally not been used for training data as extensive as ASR data , therefore what sets this paper apart is the NAS-Bench-ASR (The first NAS Benchmark for ASR models) released by the authors.

    • About the Dataset released :

      • It consists of 8,242 unique models trained on the TIMIT audio dataset for 3 different target epochs.
      • It includes runtime measurements of all the models on a diverse set of hardware platforms.
      • They have shown that identified good architectures in the search space for TIMIT dataset can be transferred to a much larger LibriSpeech dataset as well.
    • Hardware and Computation Requirements :

      • 8242 Models trained with three different seeds and 3 different target epochs(5,10 and 40) makes a total of 8,242 x 3 x 3 = 74,178 models.
      • For the purpose of training around 74,000 models ,they have used TESLA 1080Ti and Jetson Nano.

      Along with providing the common training metrics in ASR for eg. PER, CTC-loss this paper also provides information on the no. of parameters, FLOPS and latency of running all the models on two hardware platforms mentioned above.


    This is paper particularly caught our attention. We decided to replicate its results and plan to implement our own modifications in the future.

    This paper has proposed one of its kind search space customized for text representation called TextNAS. The authors have argued that little attention has been paid to search spaces as compared to the search algorithms and thus it was crucial to design one.

    The Architecture adopted for the algorithm has been RNN+CNN as opposed to the conventional methods of just using CNN for text classification.

    The contribution of this paper is threefold :

    1. Introducing a novel search space tailored for text representation.
    2. Introducing the Search algorithms adopted in TextNAS.
    3. Describing the frameworks of two tasks - text classification and Natural language interference.
    • Hardware Requirements

      Individual models are trained using a TensorFlow-based training pipeline running on a single GPU.

    • About TEXTNAS

      • The TextNAS search space consists of a mixture of convolutional, recurrent, pooling and self-attention layers.
      • Given the search space, the TextNAS pipeline can be conducted in three procedures :

        1. The ENAS search algorithm is performed on the search space by utilizing the evaluation accuracy on validation data as RL reward.

        2. Grid search is conducted by the optimal architecture to search for the best hyper-parameter setting on the validation set.

        3. The derived architecture is trained from scratch with the best hyper-parameters on the combination of training and validation data.

    • Dataset Used

      Stanford Sentiment Treebank (SST) dataset has been used to evaluate the TextNAS pipeline.


Each of the aforementioned papers expanded our knowledge of state-of-the-art NAS techniques. We synthesized our learnings from each paper and focused on replicating the results presented in the TextNAS paper. The choice to replicate this paper stemmed from its low computational requirements and its highly optimized architecture involving RNN and CNN. The paper emphasizes algorithmic optimization while also carefully considering the search space, thereby enhancing overall performance. The entire process was completed in approximately 24 hours using a single Tesla P100 GPU. We believe that continued research in this domain has the potential to significantly reduce GPU computational hours and drive advancements in Natural Language Processing.

Adnan Ali

Adnan Ali

A B.Tech. student at DCLL IIIT Delhi.

Chavisha Arora

Chavisha Arora

A B.Tech. student at DCLL IIIT Delhi.


  Write a comment ...