What does the perfect LLM product dev tool look like?

By Wojciech Gryc on November 25, 2023

What is the perfect product for managing, maintaining, and improving LLM-powered products? While dozens of tools to help build LLM-powered products are being launched, it's unclear which of them — if any — will be around in a few months.

PhaseLLM initially began as a wrapper for popular language models. We still support this, enabling users to plug or replace versions of GPT, Claude, Llama, and more. As we work on the next iteration of our product, we've created a list of themes and features we think are critical for any product meant to help people build LLM-powered products.

There's a lot here... And if you are reading this, there's a good chance we reached out to you to get your feedback on these priorities. If you have comments, ideas, additions, or other feedback, please reach out to Wojciech at w [at] phaseai [dot] com

A quick note on the categories below.

i> We don't provide too much narrative or prose here, but a list of features and an explanation of what we mean. This is meant to be a "to the point" list. This list is also provides in descending order of priority, which is another thing we are validating via market and customer discovery interviews.

Advanced Prompting

Chaining: build chains of messages based on responses, data received by the LLM, or other conditions that are met or not met. ChainForge is a fantastic example of how sophisticated (and beautiful) products like this can be.

Routing: related to chaining is routing. Chaining typically follows a linear path, whereas routing focuses on adding logic to chains so that an LLM might do different things or follow different prompts, depending on the chat.

Prompt Optimization: improve prompts based on observed data. While numerous tools purportedly improve prompts, I have yet to see anything that does so in a verifiable, scientific manner.

Data Access: support for Retrieval Augmented Generation (RAG) or other approaches to pulling external data and including it in prompts.

Function Calling: fine-tuning or pushing models to respond with functions or structured data.

Prompt Templating: provide support for variables, templates, and other ways to turn prompts into more dynamic components. Most prompting tools already support this.

Evaluation

Manual Evaluation and Scoring: score chat or responses across one or more variables. Ideally provides beautiful visualizations that help score models and clearly show which is better or worse.

Automated Evaluation: enable automated scoring via third-party LLMs or models.

Experiments: run formal, statistically sound experiments. This includes logging hypotheses and tracking statistical significance.

Rerun Chats: be able to rerun chats to see if different prompt strategies or other conditions would yield different results.

Arena-Like Competitions: or, think blind reviews where you choose the best response from a set of LLM responses. FastChat is a good example of a project that enables chat comparisons like this.

Logging, Testing, and Observability

Review Chats: provide support for logging chats so non-technical staff can review them. This is the key feature behind PhaseLLM Evals.

Regression Testing: automatically track and rerun chats when models are updated or other conditions change. Think of this as regression testing for LLMs.

Tracking Cost, Tokens, Latency: track costs, latency, token usage, and other variables by user, by model, by API endpoint.

API Access

Convert Chains to API Endpoints: if you're already building and optimizing chains directly, you might as well enable serving those chains as endpoints.

API Proxy: ... and if you are providing endpoints like this, you might as well provide support to be the API proxy for all LLM calls.

Collaboration

Team Collaboration: enable teams to work together on prompts, with version control and edit tracking.

Data Sharing: building data sets to track prompt effectiveness is laborious. If you build such a data set, you might want to share it. Alternatively, you might want to use third-party data sets to test your prompts or responses. Yann LeCun has discussed building an open source, Wikipedia-style repository for such data. This could be a great product feature.

Other Considerations

These considerations aren't product features, but seem to come up a great deal with discussions around LLM-powered product management and developer tooling.

Marketplaces: OpenAI's strategy around plugins and now GPTs implies there is an opportunity to become an LLM 'aggregator', where you provide a centralized repository of apps, agents, prompts, or other offerings. This could be incredibly useful for those building LLM-powered products, though no one seems to have cracked this nut yet.

Open Source: it's important to note that aside from the foundation models themselves, most products in this space have been open source and/or use open core business models. This is likely not a coincidence, as open source makes it easier for people to trust, and engage with, the brand or company. The challenge, then, becomes monetization.

Clone OpenAI, but with Llama 2: the only company providing really innovating on user-facing features like chats, APIs, GPTs, plugins, and function calling is OpenAI. One approach to building a lasting brand in this space would be to build a similar type of offering, but via an open source model like Llama 2.

So what?

In short: PhaseLLM (and our parent company, Phase AI) is working on building at least parts of the above. If we missed anything, if you have feedback, or want to collaborate, please reach out at w [at] phaseai [dot] com

© 2023