Running Inference With BERT Using TensorFlow Serving

Company

Symbl.ai

Date Published

Aug. 31, 2021

Author

Sekhar Vallath

Word count

1217

Language

English

Hacker News points

None

URL

symbl.ai/developers/blog/running-inference-with-bert-using-tensorflow-serving

Summary

BERT is a powerful natural language processing tool that can handle a wide range of tasks, including named entity recognition, sentiment analysis, and classification. It was trained using a simple strategy known as word masking, where words in sentences were randomly masked and the model was asked to predict what each masked word was. By wrapping BERT in TensorFlow Serving, developers can optimize it for memory-efficient, low-latency settings. The process involves adding an extra layer on top of the final layer in the encoder stack of the architecture, transforming the output from (batch_size, max_seq_length, hidden_units) to (batch_size, max_seq_length, vocab_size). The model is then saved into the SavedModel format using the Estimator class's export_savedmodel function. After saving the model, it can be hosted on a Docker container with TensorFlow Serving as the base image, allowing for low-latency predictions to be made by sending a REST API request to the served model.