Applied NLP #16
Efficient Few-Shot Learning, Putting LLMs into production, PLUS Moneyball-for-everything
Hey readers,
I’m excited to share some of my learnings over the past week. Let’s get to it!
SetFit (Sentence Transformer Fine-tuning)
Links: Github, Paper, Blog post
One of the biggest challenges with building machine learning models is having little to no labeled data. In NLP, tools like GPT-3 have so much excitement around them because they can accomplish tasks like classifying text into categories with zero to few examples (i.e. zero-shot and few-shot learning)!
While this is an accomplishment worth celebrating, telling GPT-3 what you want it to do via a free-text interface (aka “prompt engineering”) is still an art. An interface that behaves probabilistically will result in unexpected outputs (ie. “hallucinations”) and therefore frustration.
I think this tweet sums up the issues with prompt engineering pretty well.
So imagine my excitement when I learned about SetFit, a library that allows you to build accurate text classifiers with only a little amount of labeled data. No need for prompts.
Below is a table from the blog post that shows how well SetFit performs on the RAFT(Realworld Annotated Few-shot Tasks) dataset compared to other models designed to classify text with very little labeled data. Notice that SetFit (MP-Net) outperforms GPT-3 despite having 1000x fewer model parameters.
This is significant because fewer model parameters usually mean faster and cheaper model training and deployment.
I’m looking forward to digging more into SetFit!
Good Reads from the Past Week
WebGPT
Very neat approach to building a web-searching experience. The app does the following steps:
Triggers a web search and gets the first 3 results
Scrape the content from the web pages of those results
Summarize the content of each web page
Synthesize the content into a final answer with references. This part is very cool. While GPT-3 is very good at step 3 out of the box it needed to learn how to do the synthesis. It only took 2 examples!
You can learn more from this thread.
Productizing Large Language Models
This was an insightful read on some of the challenges of putting Large Language Models into production and having them perform consistently. Replit uses them for their Ghostwriter feature that can autogenerate code similar to Github Copilot!
A lot of the challenges they outline in the blog post fall into these categories:
Ceativity of the LLMs - this is tuned with the temperature parameter which controls how likely the model is to generate words that are similar to the words it has seen before. A temperature of 0 makes the LLM more predictable and less creative. A high temperature makes the model more creative but increases the chances of giving a nonsensical answer. Part of putting LLMs into production is tuning this parameter similar to hyperparameter tuning in standard ML training loops.
Repetition in the code completions - This can be controlled with a frequency penalty parameter. The higher the frequency penalty, the less likely the model is to generate words that it has previously generated. This prevents a repetition loop but penalizing too much would be for use cases where some repetition makes sense like code or poetry.
Performance - LLMs contain billions of parameters which makes them slow by default. The architectures of these models can be optimized for faster code completion. Some examples of this are FasterTransformer and using smaller transformer models with fewer parameters such as DistillBert which increases code completion with a slight tradeoff of in quality.
An SPA Alternative
This is an interesting essay that argues that the web development pendulum has swung too far towards JavaScript and that SPAs are complex monstrosities that don’t offer enough of performance gains.
I’ll admit, I’m a fan of the essay. I’ve learned React on the job and I was never a fan. It has a powerful community and a lot of pre-built components but I never really enjoyed building frontends with it. It always felt like an uphill battle. I missed Django.
I’ve been using the htmx library for a new SaaS product I’m building and now I feel like I can get the benefits of server-side rendered HTML (via Jinja templates) and also building the interactivity I need without having the frontend code revolve around JS. A sprinkle of it here and there will do.
What Moneyball-for-Everything Has Done to American Culture
When I read this essay it reminded me of recommendation systems that cause users to be trapped in bubbles. The taste and preferences of users become a “solved” or “finite” problem that moves some metric up and to the right.
While that might feel like a win on paper, you sacrifice serendipity and surprise. As Derek Thompson puts it, “Its genius dulls the rough edges of entertainment”.
Yes, the number of home runs and strikeouts is increasing because the analytics say those lead to wins. But the overall product “feels” worse as this top-line metric suggests and if I’m in charge of the MLB I think I’d be worried.
As a big Yankees fan (the Yankees are the worst offenders of this btw), I rarely see exciting rallies where 3 singles or doubles get strung together to tie the game. In crucial moments every player is always trying to swing for the fences. Rallies like the video below seem to happen much less.
It feels like the spectrum of outcomes in baseball has been reduced to walks and home runs on offense and changing pitchers every inning which in my opinion makes the game slower and much more boring.
Now, this is all just opinion and I’m sure I’m coming off as a curmudgeon.
The bottom line is users or fans care how they feel about the product or the game and not the metrics you’re optimizing for. Teams obviously care about wins first and foremost but from the MLB’s perspective, the game needs to be played beautifully to retain fans.
I think the broader lesson I learned from reading this essay is that anytime you game some metric like click-throughs or walks you need to use your intuition and a top-line metric to make sure the overall product doesn’t seem worse.