If you're anything like me, you've probably been fascinated by the world of data science and all its possibilities. And it’s safe to say, one of the hottest topics right now is Large Language Models (LLMs).
But here's the thing: for all the buzz around LLMs, there's also a lot of misunderstanding and misconceptions. It's nobody's fault; LLMs may appear straightforward on the surface, yet understanding their true value involves a bit more digging.
With a strong passion for data science, my goal has always been to offer a distinct perspective on it that focuses on a practical blend of the "why" and "how,”. Recently, I've been knee-deep in LLMs, learning a ton along the way and today, I'll be sharing some insights on a relationship that’s been popping up a lot - LLMs and Value.
This piece is going to be an extensive deep dive into LLMs and how they can deliver organizational value to enterprises, all from a technical perspective rather than a leadership perspective.
Understanding data and the inception of LLMs
You can’t and won’t understand LLMs if you don’t understand data from a very basic level. Data comes in a variety of forms. There's text, images, videos, streaming data that leans towards waveforms, and even geographical data. You can't discount any of them.
I believe that any advancements in development, particularly in data science, will inevitably revolve around these diverse types of data. That's the foundation.
Now, when it comes to the science part, it's about delving into these data types, studying them, and drawing conclusions. When I first ventured into the field as a data analyst or scientist, my first and primary interaction was with text data.
When someone enters this field, they tend to gravitate towards one form of data, and their entire perspective revolves around that specific type. For me, it was text.
Back then, when I began my career, machine learning (ML) was being overshadowed by deep learning, especially in the realm of text data. Enterprises and business professionals were content with a black-box solution that delivered accurate results.
Let's take sentiment analysis, for instance. You provide them with the output, and they're satisfied. The intricacies of how that output is generated don't matter to them.
This acceptance of black box solutions paved the way for large language models (LLMs). This shift in attitude made deep learning more palatable because otherwise, sticking with traditional machine learning or basic statistics would’ve sufficed.
Deep learning came into the forefront, particularly in handling text data, leading to the development of transformers, and eventually, the emergence of LLMs, or large language models.
How big are LLMs today?
Before the advent of large language models (LLMs), just about a year and a half ago, we were grappling with classification tasks. Imagine a scenario where you simply need to sort data into either Bucket A or Bucket B, assigning them to specific categories. It sounds straightforward, right?
However, without LLMs, things weren't as simple. Especially when dealing with classification tasks in specialized domains like medical documents, you would need domain-specific data. Getting hold of this medical data isn't exactly a walk in the park. It often involves curating datasets from the clinical or medical industry, which can be quite a challenge.
In the past, we had to resort to gathering papers, converting them into datasets, and then training models on top of them. These models would need to grasp the nuances of fields like biomedicine and clinical industries to perform the classification accurately. It was a laborious process, to say the least.
Then came LLMs which changed the game entirely. We no longer need to spend hours curating datasets. Instead, we just need to check whether the LLM already possesses knowledge about the domain in question.
You see, LLMs are essentially a culmination of internet data. They've been trained on vast amounts of text data collected from the internet over the past six decades or so. Even a fraction of this data is incredibly valuable. LLMs have absorbed this wealth of information, making them a sort of embodiment of the internet itself.
“For me, as a data scientist, this shift was particularly fascinating. It meant that I no longer had to engage in task-specific, domain-specific data-gathering exercises. Instead, I could rely on this general-purpose machine to handle most of the heavy lifting.”
Think of it as having the entirety of the internet at your fingertips, condensed into a single machine. This not only streamlines processes for professionals like myself but also democratizes access to advanced language understanding for the general populace.
In a way, it's reminiscent of the phrase 'If you don't know something, Google it.' Now, it's poised to become 'If you don't know something, GPT it’.
Understanding ‘value’ in LLMs
Alright, let’s illustrate this with examples from two distinct industries, Recruitment and EdTech, to showcase the impact and value brought forth by large language models (LLMs).
The remarkable aspect about the value these models offer is that they transcend enterprise boundaries. Since LLMs are essentially generic machines, any value derived from them is universally applicable across industries, making it a one-size-fits-all solution.
Let's begin with the recruitment industry. Even before the rise of LLMs, I was deeply intrigued by the prospect of automating interview processes. When LLMs emerged, my curiosity was at its highest.
Could LLMs evaluate answers, ask follow-up questions, and even conduct interviews? The answer to all three questions was a resounding yes.
Now, how do we look at ‘value’ in this scenario? Consider a company with a workforce of around 500 individuals, where a subset of 20 to 30 employees is dedicated to conducting interviews throughout the year. The sheer amount of man-hours saved by deploying LLMs in this scenario is staggering, potentially amounting to thousands of hours.
This reduction in man-hours not only translates to cost savings but also frees up senior resources to focus on delivering value or generating revenue for the company. It's a win-win situation where both the company and its employees benefit from the efficiency brought about by LLMs.
Moreover, LLMs eliminate the inherent biases that often creep into human-led interview processes. Whether it's ethnic, cultural, or linguistic biases, LLMs treat every candidate equally, ensuring a fair and unbiased assessment. This removal of bias adds another layer of value, ensuring that hiring decisions are based solely on merit, devoid of any prejudicial factors.
Moving on to the EdTech sector, LLMs hold immense promise in enhancing the learning experience for students. Putting it bluntly, EdTech companies want to retain students while simultaneously attracting new ones. One effective way to achieve this is by ensuring high levels of engagement among students.
Traditionally, incentivizing students through rewards has been a common approach. However, LLMs offer a more innovative solution now. By leveraging LLMs to evaluate students' performance in real time, EdTech platforms can provide personalized feedback tailored to each student's learning objectives. This immediate feedback loop not only enhances the learning experience but also fosters a deeper engagement with the material.
This is where the value addition comes into play. Let's say you're a third-grader who's a huge fan of Marvel comics and movies. Imagine this scenario: Iron Man zips from one place to another, covering a distance of X in a certain amount of time. Now, what's his speed?
It's not your typical textbook problem involving ducks or trains; it's something more relatable, like a bicycle or a person walking. That's the engagement factor that LLMs are introducing.
In essence, the integration of LLMs into EdTech platforms has the potential to revolutionize the way students learn and engage with educational content. As these innovations are implemented on a larger scale, we can anticipate significant shifts within the EdTech industry and beyond.
When it comes to industry value, LLMs hold the power to revolutionize entire sectors and the organizations within them. It's a race where being at the forefront matters most. Your marketing strategy should revolve around showcasing the benefits of AI — how it can enhance your child's learning or deliver specific value to your customers. Falling behind means being left in the dust.
Addressing bias, fairness, and ethical considerations
This is a crucial concern to address because, as they say, every bright side has its shadow. Vulnerability emerges as a major concern, encompassing compliance and security risks. Compliance may seem secondary, but it's the risks we undertake that truly matter.
“Often, in our willingness to take risks, we overlook the ethical considerations for our customers. These are aspects we must carefully weigh.”
Consider, for example, the healthcare and insurance sectors governed by regulations like HIPAA. Compliance with HIPAA entails adherence to PHI norms and ensuring the protection of sensitive information. Similarly, standards such as NIST, CMMC (for defense in the US), GDPR (for the EU), PCI DSS (for card payments and finance companies), and Sox audits have become integral across various industries.
These compliance setups are essential not only to demonstrate ethical practices but also to safeguard sensitive information held by companies, protecting the interests of clients, customers, investors, and stakeholders. Failure to meet these standards can lead to fines and legal action.
Now, let's address LLMs. Sharing sensitive or proprietary data with an LLM poses a vulnerability and to be candid, it's a very significant one.
Data exists in two states: at rest and in transit. 'At rest' refers to data stored in databases, while 'in transit' pertains to data transmitted through systems such as APIs or emails. Vulnerabilities associated with LLMs predominantly arise during the 'in transit' phase, with fewer concerns related to 'at rest' data.
Companies typically employ a four-layered protection system, with the authorization layer granting access to authorized personnel. However, even this may not suffice for LLMs, especially if they're externally hosted, such as OpenAI. Despite authorization, using data inside an external LLM may not be permissible.
The second concern lies in preventing data abuse by others. Encrypting data both at rest and in transit renders it incomprehensible to LLMs. This encryption poses a challenge to the LLM's interpretability. Because they cannot decipher encrypted data, it’s just gibberish to them.
So, when it comes to handling sensitive data, we're left with a few alternatives: pseudonymization or anonymization, so let’s look at them in detail.
Anonymization and Pseudonymization:
Anonymization stands out as the primary approach, especially in scenarios involving medical data. We typically anonymize information like names, dates, and any PHI indicators before passing it to the LLM for analysis.
The LLM generates a medical summary based on this anonymized data, which we then fill in with relevant details. However, anonymization isn't foolproof and I’ll illustrate that with a medical example.
Let's say we anonymize age data for a medical summary mentioning age-related conditions like arthritis or cataracts. Here's the issue: when the LLM summarizes the data, it may mistakenly assume the patient's age, overlooking the age-related context.
To address this, we've developed another method: pseudonymization. Instead of providing an exact age like 32, we might offer a range, such as 'in their 40s'. Alternatively, we can use pseudonyms like 'John Doe' to preserve the context of a person's name. Even though LLMs tend to be biased towards binary genders, pseudonymization helps maintain anonymity while retaining contextual relevance.
In the medical and financial domains, pseudonymization is a valuable tool. Sometimes, it works better than anonymization alone. Both approaches are integral to our dealings with LLMs and I want to show this with a personal example of how we innovatively dealt with an LLM.
In the field of Ed Tech, privacy concerns are paramount, especially when dealing with student data. To address this, we implemented a token system.
Instead of directly referring to students by name, we use tokens like '<St name>' to represent their identities. If you start with a generic "Hello, student," it won't make the cut either because students can tell when it's a machine addressing them.
The key is personalization — address them by their first name. So how do we do this with the token system? We replace the actual student's name with a token in a Python layer. This means the LLM can still generate responses tailored to the student, like mentioning their class teacher or any other relevant details.
The Python layer handles this personalization while keeping the student's identity anonymous for the LLM. Our system then maps these tokens to actual student details, ensuring a personalized experience without compromising privacy.
These measures are essential in navigating data vulnerability. Ethical considerations are important; thus, we're compelled to devise ingenious solutions to circumvent potential risks.
One viable option is hosting our own LLM, though it entails significant investment. However, considering the value of our clients' data and potential fines for non-compliance, owning our LLM becomes a far more compelling last resort.
Are enterprises slow to adopt LLMs? - A technical perspective
This is one of the biggest questions people are asking themselves. While I want to provide a technical perspective on this thought process, I think it’s necessary to address the quicker answer first.
Many data scientists, while adept in their fields, find themselves narrowly focused on machine learning or deep learning techniques. Transitioning to Large Language Models (LLMs) presents a considerable paradigm shift.
It's challenging for in-house data scientists to swiftly adapt, as they're often preoccupied with refining existing models. This status quo underscores the importance of specialized LLM expertise.
Companies with dedicated LLM specialists are the ones that will capitalize on this emerging frontier, leaving others lagging.
Now, let’s get into the technical side of this.
Enterprises have to worry about determining the optimal approach. Should they opt for OpenAI's offerings or invest in hosting their own models? Calculating costs, estimating interaction volumes with LLMs, and discerning the cost-benefit ratio pose substantial challenges.
Moreover, questions regarding domain complexity and whether a generic LLM suffices or if a domain-trained LLM is necessary are everywhere.
Then comes the pivotal decision: Should enterprises deploy a model or develop an in-house LLM within their ecosystem, similar to Bloomberg's approach? While the allure of owning a bespoke LLM is strong, it's important to weigh the costs against potential benefits.
Rushing into decisions without due diligence can lead to significant financial investments in unnecessary fine-tuning, as illustrated by various hypothetical scenarios.
“Contrary to popular belief, enterprises are NOT oblivious to LLMs; rather, they’re dealing with translating technical considerations into actionable decisions.”
The challenge lies in articulating and comprehending the nuances of these decisions. Evaluating options, understanding the trade-offs, and setting realistic expectations for LLM capabilities is where the focus is.
Enterprises seek clarity — clear delineation of pros and cons, precise expectations, and a comprehensive understanding of the LLM landscape. They aspire for a unified solution that addresses multiple tasks without compromising efficacy. The way I see it, the process of decision-making regarding LLM adoption is deliberate and meticulous.
Undoubtedly, enterprises are diligently researching LLM capabilities and experimenting with platforms like OpenAI Cloud. However, the final decision hinges on a myriad of factors: data sensitivity, security measures, and the practicality of integrating LLMs into existing workflows.
These deliberations are complex and multifaceted, demanding careful consideration and thorough evaluation. Rushing into these decisions risks suboptimal outcomes which is far more detrimental for the organization.
How to look at Pretraining and Fine-tuning LLMs
I’ve already mentioned that if privacy is your main concern, building your own, custom LLM is the safer option. If you want to go down this route, however, there are two fundamental aspects of model training you should understand and differentiate first - pretraining and fine-tuning.
Pretraining:
Pretraining serves as the genesis of LLMs. It involves training a model on vast datasets, often sourced from the internet. However, for many enterprises, pretraining isn't practical or feasible due to its resource-intensive nature.
Consider the monumental computational power required to train an LLM on even a fraction of internet data – a task that could incur costs in the millions. For instance, training an LLM on a subset representing one-tenth of the internet data would necessitate thousands of GPUs and could cost up to $10 million.
Consequently, not all organizations opt for pretraining, especially when high-quality pre-trained models, such as those offered by OpenAI and Anthropic, are readily accessible.
However, there are scenarios where pre-training becomes indispensable. Take, for instance, the construction sector. Imagine your company deals with building plans stored in text-based specification documents amassed over decades. Such proprietary data isn't readily available on the internet and demands a specialized approach.
Now, consider all this data you possess – it's not your typical internet data. This is proprietary information you've held onto for years. Your run-of-the-mill LLM won't cut it here. It won't comprehend or address your specific needs.
So, what's the solution? Use that data to train your own model tailored to your requirements. LLMs respond in general English. However, if you need them to speak in a construction-specific jargon or lingo, you'll have to do the legwork yourself.
In such cases, training a custom LLM using your organization's specific data becomes imperative. Unlike generic LLMs trained on internet data, a tailored model can provide insights pertinent to your industry's unique requirements.
The key takeaway is to discern the relevance of pretraining based on the nature and availability of your data. While pre-training holds significant promise for industries with specialized datasets, it may not be necessary for those leveraging open-source models.
Ultimately, the decision hinges on aligning your organization's needs with the capabilities and constraints of pretraining in the context of LLMs.
If your data requires a specialized approach, then training your own LLM is the way to go. Otherwise, there's no need; plenty of pre-trained models are available, crafted from subsets of internet data.
To simplify: pre-training isn't for everyone. It's for those with unique, non-internet-based datasets and specialized jargon.
Now, let's shift our focus to fine-tuning.
Fine-tuning:
Imagine the same scenario as pretraining but on a smaller scale. Instead of thousands of GPUs and millions in costs, you only need a fraction of that – a modest dataset and a handful of GPUs. For around $10,000, you can fine-tune your LLM to align with your domain expertise.
But here's the caveat: while fine-tuning, you must be meticulous about two key factors: the task and the model's behavior.
Fine-tuning isn't just about tweaking; it's about defining a specific task for the model to excel in. Many mistakenly equate it with simple Q&A, but it's far more nuanced. It's about imbuing your LLM with domain-specific knowledge, transforming it from a text completion tool into a domain-specific powerhouse.
To make this a reality, you must align your prompt precisely with the common use cases expected by the LLM or its users. Let's take an example from a project I'm overseeing.
Imagine a company with an e-commerce site seeking SEO-optimized content generation for its products. They not only want the content to be SEO-friendly but also to seamlessly integrate their product portfolio into the generated content. Here's where fine-tuning comes into play.
Fine-tuning is the ideal approach here for a couple of reasons. Firstly, since the goal is to produce SEO-optimized content consistently, fine-tuning allows you to define this task explicitly. The prompt is crafted to elicit SEO-friendly content generation each time, tailored to the specific needs of the company. This transforms the LLM into a specialized tool for SEO content creation, rather than a general-purpose language model.
Once fine-tuned, the LLM's primary function shifts. It's no longer about casual conversation or poetic output; instead, its focus narrows to generating SEO-optimized content on demand. Whenever prompted with a request for web content, the model responds with output tailored for SEO purposes, formatted, and ready for publication on the company's website.
Fine-tuning is a concept well-understood in industries like image generation, but its nuances are often lost in the realm of text-based applications. Understanding its intricacies is crucial for maximizing the potential of language models in specific domains.
The missing link is education
The challenge we've recognized stems from a widespread lack of effective understanding regarding fine-tuning. Many individuals simply don't grasp how to fine-tune LLMs efficiently.
Picture this scenario: attempting to fine-tune a language model initially trained for SEO generation to instead perform summarization. Unfortunately, the outcomes often fall short, resulting in brief, inadequate summaries rather than the comprehensive content essential for SEO.
To tackle this challenge, meticulous curation of datasets and precise prompt engineering is needed. By fine-tuning the model with well-crafted prompts, consistent production of desired outputs becomes achievable, sparing the hassle of crafting lengthy prompts repeatedly. Over time, this method accumulates valuable saved tokens, crucial assets within the language model industry.
Educating stakeholders on the significance of fine-tuning is the most important direction. Although many acknowledge its importance, the practical implementation often eludes them.
Employing methods like Retrieval-Augmented Generation (RAG) enables LLMs to ingest contextual information beyond their initial training data, bridging the gap between general knowledge and domain-specific understanding.
Fine-tuning efficiently imparts domain knowledge to LLMs with minimal data, yet its complexities demand a thorough understanding and strategic selection of approaches, such as alignment tuning and instruction tuning.
The transition from NLP awareness to practical implementation can be understandably daunting for many. Fine-tuning methods, including self-supervised, supervised, or reinforcement learning, each have distinct advantages and considerations. Furthermore, algorithms like Low-Rank Adaptation (Lora) offer innovative solutions, simplifying the process of loading massive models and facilitating effective fine-tuning.
Education on these nuances is essential for broader comprehension and informed decision-making among industry professionals and customers alike.
Predictions for the future
Yes, I know this is a long read already, but I’d like to finish this discussion with a few personal predictions for the future of LLMs. We’ll be done with this soon, I promise!
I had a hunch about Sora a year ago. Yes, OpenAI was gearing up to release the Sora model, which generates text. I've been tracking OpenAI's trajectory since my gaming days. Remember when they put their bots into Dota games? They made waves by defeating top players and all-stars.
I knew then that their reinforcement capabilities were something to look out for.
Now, here's the thing: OpenAI is heading towards IoT and AI chips. Sam Altman's recent message about investing in IoT confirms this direction. They've been tying their LLMs with videos, audio, images, and text — building comprehensive graphical models.
But there's a gap: real-time processing. That's where I believe the next big leap will occur — not just in language models but in large-scale models with real-time inference capabilities.
Imagine watching a football game and instantly getting detailed analysis as the action unfolds. That's how I see the future. But currently, the delay between video-to-text translation and data formatting means it takes too long. We're talking 15 to 20 minutes.
Real-time AI commentary is the golden egg everyone's chasing. Whoever cracks it will dominate the market for years to come. To stay ahead, this is where they need to invest their efforts. Don't overlook it. Keep your eyes peeled for this trend — it's going to revolutionize the industry.