Beginning
A few months ago, while working on the Databricks with R workshop, I came across some of their custom SQL functions. These particular functions are prefixed with “ai_” and run the NLP with a simple SQL call:
> SELECT ai_analyze_sentiment('I am happy');
positive
> SELECT ai_analyze_sentiment('I am sad');
negative
It was a revelation to me. It introduced a new way to use the LLM in our daily work as analysts. So far, I’ve used LLM primarily for code completion and development tasks. However, this new approach instead focuses on using LLM directly against our data.
My first reaction was to try to access user functions via R. With
dbplyr
we have access to SQL functions in R and it was great to see how they work:
orders |>
mutate(
sentiment = ai_analyze_sentiment(o_comment)
)
#> # Source: SQL (6 x 2)
#> o_comment sentiment
#> <chr> <chr>
#> 1 ", pending theodolites … neutral
#> 2 "uriously special foxes … neutral
#> 3 "sleep. courts after the … neutral
#> 4 "ess foxes may sleep … neutral
#> 5 "ts wake blithely unusual … mixed
#> 6 "hins sleep. fluffily … neutral
One downside to this integration is that while it is accessible through R, we need a live connection to Databricks to use LLM in this way, limiting the number of people who can benefit from it.
According to their documentation, Databricks uses the Llama 3.1 70B model. Although it is a highly efficient model of a large language, its enormous size poses a significant challenge for most user computers, making it impractical to run on standard hardware.
Achieving viability
The development of the LLM is accelerating at a rapid pace. Initially, only online large language models (LLM) were usable for everyday use. This has raised concerns among companies that are hesitant to share their data externally. Additionally, the cost of doing an LLM online can be significant, the token fees can add up quickly.
The ideal solution would be to integrate the LLM into our own systems, which requires three basic components:
- A model that fits comfortably in memory
- A model that achieves sufficient accuracy for NLP tasks
- Intuitive interface between the model and the user’s laptop
Last year it was almost impossible to have all three of these elements. Memory-capable models were either inaccurate or too slow. However, recent advances such as Meta’s Llama and cross-platform interaction engines such as Ollama have enabled the deployment of these models and offer a promising solution for companies looking to integrate LLM into their workflows.
Project
This project began as an exploration, driven by my interest in using a “universal” LLM to achieve results comparable to those from Databricks’ AI functions. The primary task was to determine how much setup and preparation would be required for such a model to provide reliable and consistent results.
Without access to the design document or open-source code, I relied solely on the LLM output as a test base. This presented several obstacles, including the many options available for fine-tuning the model. Even within rapid engineering, the possibilities are huge. To ensure that the model was not too specialized or focused on a particular subject or outcome, I needed to strike a delicate balance between precision and generality.
Fortunately, after doing extensive testing, I found that a simple “one-off” challenge produced the best results. By “best” I mean that the answers were accurate for a given row and consistent across multiple rows. Consistency was essential because it meant giving answers that were one of the specified options (positive, negative, or neutral) without any further explanations.
The following is an example of a challenge that worked reliably against Llama 3.2:
>>> You are a helpful sentiment engine. Return only one of the
... following answers: positive, negative, neutral. No capitalization.
... No explanations. The answer is based on the following text:
... I am happy
positive
As a side note, my attempts to send multiple rows at once failed. I’ve actually spent a fair amount of time exploring different approaches, such as sending 10 or 2 rows at a time and formatting them in JSON or CSV formats. The results were often inconsistent and it didn’t seem to speed up the process enough to make it worth the effort.
Once I was comfortable with this approach, the next step was to package the functionality in an R package.
Access
One of my goals was to make the mall package as “ergonomic” as possible. In other words, I wanted to ensure that using the package in R and Python would integrate seamlessly with how data analysts use their preferred language on a daily basis.
It was relatively simple for R. I just needed to verify that the functions work well with the pipeline (%>%
and |>
) and can easily be incorporated into packages such as those in tidyverse
:
reviews |>
llm_sentiment(review) |>
filter(.sentiment == "positive") |>
select(review)
#> review
#> 1 This has been the best TV I've ever used. Great screen, and sound.
However, for Python, not being a native language for me meant that I had to adapt my thinking about data manipulation. Specifically, I learned that in Python, objects (like pandas DataFrames) “contain” transformation functions by design.
This insight led me to investigate whether the Pandas API allowed extensions, and luckily it did! After exploring the options, I decided to start with Polar, which allowed me to extend its API by creating a new namespace. This simple add-on gave users easy access to essential features:
>>> import polars as pl
>>> import mall
>>> df = pl.DataFrame(dict(x = ("I am happy", "I am sad")))
>>> df.llm.sentiment("x")
shape: (2, 2)
┌────────────┬───────────┐
│ x ┆ sentiment │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════════╡
│ I am happy ┆ positive │
│ I am sad ┆ negative │
└────────────┴───────────┘
Keeping all the new features in the llm namespace will make it very easy for users to find and use the ones they need:
What’s next?
I think it will be easier to know what is coming mall
once the community uses it and provides feedback. I suppose the main requirement will be to add more LLM backends. Another possible improvement will be when new updated models become available, then the prompts for that model may need to be updated. I experienced this when switching from LLama 3.1 to Llama 3.2. One of the challenges needed tweaking. The deck is structured so that future enhancements like this will be additions to the deck, not prompt replacements, to maintain backwards compatibility.
This is my first time writing an article about the history and structure of the project. This particular effort was so unique due to R + Python and its LLM aspects that I thought it was worth sharing.
If you want to know more about mall
feel free to visit its official site: https://mlverse.github.io/mall/