Book Notes - LLMs in Production

LLMs in production

A word of warning: Embrace the future now All new technology meets resistance and has critics; despite this, technologies keep being adopted, and progress continues. In business, technology can give a company an unprecedented advantage. There’s no shortage of stories of companies failing because they didn’t adapt to new technologies. We can learn a lot from their failures

Like all other skills, your proximity and willingness to get involved are the two main blockers to knowledge, not a degree or ability to notate—these only shorten your journey toward being heard and understood. If you don’t have any experience in this area, it might be good to start by first developing an intuition around what an LLM is and needs by contributing to a project like OpenAssistant. If you’re a human, that’s exactly what LLMs need. By volunteering, you can start understanding what these models train on and why. If you fall anywhere, from no knowledge up to being a professional machine learning engineer, we’ll be imparting the knowledge necessary to shorten your time to understanding considerably.

Our current understanding of language is that language is made up of at least five parts: phonetics, syntax, semantics, pragmatics, and morphology. Each of these portions contributes significantly to the overall experience and meaning being ingested by the listener in any conversation.

  • Phonetics is probably the easiest for a language model to ingest, as it involves the actual sound of the language. This is where accent manifests and deals with the production and perception of speech sounds, with phonology focusing on the way sounds are organized within a particular language system.
  • Syntax is the place where current LLMs are highest-performing, both in parsing syntax from the user and generating its own. Syntax is generally what we think of as grammar and word order; it is the study of how words can combine to form phrases, clauses, and sentences. Syntax is also the first place language-learning programs start to help people acquire new languages, especially based on where they are coming from natively
  • Semantics are the literal encoded meaning of words in utterances, which changes at breakneck speed in waves. People automatically optimize semantic meaning by only using words they consider meaningful in the current language epoch.

In the seminal paper, “Attention Is All You Need,” [1] Vaswani et al. take the mathematical shortcut several steps further, positing that for performance, absolutely no recurrence (the “R” in RNN) or any convolutions are needed at all.

Reasoning and Acting (ReAct) is a few-shot framework for prompting that is meant to emulate how people reason and make decisions when learning new tasks. [10] It involves a multistep process for the LLM, where a question is asked, the model determines an action, and then observes and reasons upon the results of that action to determine subsequent actions.

PUSHING THE BOUNDARIES OF COMPRESSION After going down to int4, there are experimental quantization strategies for going even further down to int2. Int2 70B models still perform decently, much to many peoples’ surprise.

We can enhance this approach in an important way that we haven’t touched on yet: using knowledge graphs. Knowledge graphs store information in a structure that captures relationships between entities. This structure consists of nodes that represent objects and edges that represent relationships. A graph database like NEO4J makes it easy to create and query knowledge graphs. And as it turns out, knowledge graphs are amazing at answering more complex multipart questions where you need to connect the dots between linked pieces of information. Because they’ve already connected the dots for us

read more

Links - Morgan Stanley uses AI evals to shape the future of financial services

Sharing this post on Morgan Stanley and AI from the OpenAI blog.

“This technology makes you as smart as the smartest person in the organization. Each client is different, and AI helps us cater to each client’s unique needs.” Jeff McMillan, Head of Firmwide AI at Morgan Stanley

Evals

To evaluate GPT-4’s performance against their experts, Morgan Stanley ran summarization evals to test how effectively the model condensed vast amounts of intellectual capital and process-driven content into concise summaries. Advisors and prompt engineers graded AI responses for accuracy and coherence, allowing the team to refine prompts and improve output quality.

The biggest takeaway

The eval framework wasn’t static; it evolved as the team learned.

This should be expected with Generative AI, your eval framework should not be static, and your project is never done.

Expanding the corpus

This is massive and extremely well done

“We went from being able to answer 7,000 questions to a place where we can now effectively answer any question from a corpus of 100,000 documents,” says David Wu, Head of Firmwide AI Product & Architecture Strategy at Morgan Stanley.

Adoption

over 98% adoption in wealth management

!! !! !!

Their strong eval framework has also unlocked a flywheel for future solutions and services.

I am currently reading The Value Flywheel Effect so this really resonated with me.

They’ve tackled the two hardest obstacles, in my opinion, they’ve created a very successful project AND gotten widespread adoption. When stakeholders are bought in like this, releasing additional projects should have a much lower barrier to entry.

read more

Links - Lucky Maverick | The Secret to Success - Mimic Evolution

Sharing this post from Jonathan Bales on Evolution.

most decisions really don’t matter.

What to Learn from Evolution

You can be dumb and still fund hug success, as long as you

  • Perfect the trial-and-error cycle

The faster you can make the trial-and-error cycle work, the quicker you can find success

The best way to speed up your learning curve: be extreme.

  • Don’t suppress chaos

Less Data

You don’t always need more data. Usually, you need less.

The more data we have, the more likely we are to drown in it. - Taleb

This quote is so true, we wait to make a decision when we have way too much information. We could have made an educated decision months ago, but we continually delay to reduce our risk of failure. If you don’t take chances, you’ll never fail, but you’ll also never learn to overcome adversity.

Removing Fragility

Avoid risk of ruin

You should take on all kinds of risk when there’s nothing to lose.

Admit how little you know

Overconfidence is fragility.

read more

Amazon Nova foundation models

Amazon announced their Amazon Nova foundation models on December 3, 2024. The main features are low cost and low latency.

Amazon Nova

Micro

  • very low cost
  • text-only
  • great for summarization, translation, and classification
  • 128k token context

Lite

  • Multimodal
    • Multiple images
    • Up to 30 minutes of video
  • 300k token context
  • Find-tuning with model distillation is available

Pro

  • Similar to Lite but with more accuracy
  • Serves as a teacher model to distill custom varians of Lite and Micro

Premier

  • Availability in early 2025

Simon Willison compares them to the Google Gemini family of models.

Costs for micro are $0.035 per million input and $0.14 for output, $0.06 and $0.24 for lite, and $0.80 and $3.20 for Pro. The micro model is slightly cheaper than Gemini 1.5 Flash-8B and appears to be the cheapest model available.

read more

Links - No Priors Ep. 91 | With Cohere Co-Founder and CEO Aidan Gomez

This interview with Cohere CEO Aidan Gomez is a must watch (or listen) for anyone interested in AI. Here are some key takeaways:

  • The role of “luck and chance” in his journey.
  • The smallest nuance can significantly impact language model outputs, highlighting the need for robust and reliable models (Cohere is trying to build models with this in mind).
  • Tailoring AI solutions to specific needs, like generating medically relevant summaries for a doctor based on a simple phrase like “My knee hurts.” vs requiring them to look through 20 years of patient notes.
  • Listening to customers is crucial for identifying valuable applications. We are not even close to having all the answers with LLMs.
  • Begin with the most cost-effective solutions and gradually build complexity. You shouldn’t be starting from scratch is most use cases.
  • Waiting 6-12 months can dramatically reduce development costs (specifically referring to building an LLM). Staying on the bleeding edge may be profitable but waiting 6-12 months and taking advantage of everything learned is more profitable.

read more

Links - Seth Godin Severe weather alert

Like a lot of Seth Godin posts, this one is short but impactful, Severe Weather Alert

He discusses getting an alert every day about severe weather in his area, but now it’s just “weather”

We think that regularly alerting people to something is likely to get their attention again and again.

The more an application cries wolf, the more likely we are to ignore it

read more

Amazon Bedrock Updates for November 2024

Reducing hallucinations in large language models with custom intervention using Amazon Bedrock Agents

  • Amazon Bedrock Guardrails offers hallucination detection and grounding checks (existing functionality)
  • You can develop a custom hallucination score using RAGAS to reject generated responses (requires SNS)

Amazon Bedrock Flows is now generally available

  • Prompt Flows are named to Amazon Bedrock Flows (Microsoft also uses the name Prompt Flows)
  • You can now filter harmful content using Prompt node and Knowledge base node
  • Improved traceability - you can now quickly debug workflows with traceability of input and outputs

Prompt Optimization on Amazon Bedrock

  • This is exactly what it sounds like, you provide a prompt and your prompt is optimized for use with a specific model, which can result in significant improvements for Gen AI tasks

read more

Sharing - RIP to RPA - The Rise of Intelligent Automation

Notes from this article by Andreesen Horowitz RIP to RPA: The Rise of Intelligent Automation

Traditionally, robotics process automation (RPA), was a hard coded “bot” that mimicked exact key strokes necessary to complete a task. With LLMs, however, the original vision of RPA is now possible. An AI agent can be prompted with an end goal, e.g. book a flight from DSM to ORD on these dates, and will have the correct agents available to complete the task.

There is a large opportunity for startups in this space, because no existing product meets the original vision of RPA. There are two main areas:

horizontal AI enablers that execute a specific function for a broad range of industries, and vertical automation solutions that build end-to-end workflows tailored to specific industries.

read more

Future of Business - Palo Alto Networks’ Nikesh Arora on Managing Risk in the Age of AI

I really enjoyed this podcast with Nikesh Arora, CEO of Palo Alto Networks, where he discussed how much of their strategy is tied to acquisition vs trying to build everything themselves in house. He had some other insights, that probably aren’t revolutionary, but I appreciated his openness in this interview.

Here are my key takeaways:

The AI Revolution and Cybersecurity

  • With practically everything being internet connected, the potential points of vulnerability for cyberattacks are enormous
  • Bad actors are increasingly using AI to infiltrate systems, which requires companies Palo Alto Networks use AI to counter their attacks
  • He sees AI as a productivity tool that will augment human work, taking over repetitive tasks and allowing employees to focus on more enjoyable tasks

Acquisition and Integration

  • Palo Alto is acquiring innovative cybersecurity companies to stay ahead of threats
  • He stresses the importance of empowering the acquired teams and providing them with resources

Concerns and Risks

  • He discusses the importance of a zero trust security model, treating every user and device with the same level of scrutiny
  • They talk about the (obvious) potential for GenAI to be used maliciously
  • Arora anticipates regulations focusing on transparency, guardrails, and control of critical processes
  • He strongly emphasizes the importance of collaboration between industry and regulators.

read more

Book Notes - Software Engineering at Google

Software Engineering at Google

  • Programming is certainly a significant part of software engineering: after all, programming is how you generate new software in the first place. If you accept this distinction, it also becomes clear that we might need to delineate between programming tasks (development) and software engineering tasks (development, modification, maintenance).

  • distinction is at the core of what we call sustainability for software. Your project is sustainable if, for the expected life span of your software, you are capable of reacting to whatever valuable change comes along, for either technical or business reasons.
  • Importantly, we are looking only for capability-you might choose not to perform a given upgrade, either for lack of value or other priorities.² When you are fundamentally incapable of reacting to a change in underlying technology or product direction, you’re placing a high-risk bet on the hope that such a change never becomes critical

  • Team organization, project composition, and the policies and practices of a software project all dominate this aspect of software engineering complexity. These problems are inherent to scale: as the organization grows and its projects expand, does it become more efficient at producing software?

  • 2012, we tried to put a stop to this with rules mitigating churn: infrastructure teams must do the work to move their internal users to new versions themselves or do the update in place, in backward-compatible fashion. This policy, which we’ve called the “Churn Rule,” scales better: dependent projects are no longer spending progressively greater effort just to keep up. We’ve also learned that having a dedicated group of experts execute the change scales better than asking for more maintenance effort from every user: experts spend some time learning the whole problem in depth and then apply that expertise to every subproblem. Forcing users to respond to churn means that every affected team does a worse job ramping up, solves their immediate problem, and then throws away that now useless knowledge. Expertise scales better.

  • The more frequently you change your infrastructure, the easier it becomes to do so.

  • We have found that most of the time, when code is updated as part of something like a compiler upgrade, it becomes less brittle and easier to upgrade in the future. In an ecosystem in which most code has gone through several upgrades, it stops depending on the nuances of the underlying implementation; instead, it depends on the actual abstraction guaranteed by the language or OS. Regardless of what exactly you are upgrading, expect the first upgrade for a codebase to be significantly more expensive than later upgrades, even controlling for other factors.

  • We believe strongly in data informing decisions, but we recognize that the data will change over time, and new data may present itself. This means, inherently, that decisions will need to be revisited from time to time over the life span of the system in question. For long-lived projects, it’s often critical to have the ability to change directions after an initial decision is made. And, importantly, it means that the deciders need to have the right to admit mistakes. Contrary to some people’s instincts, leaders who admit mistakes are more respected, not less.

  • Programming is the immediate act of producing code. Software engineering is the set of policies, practices, and tools that are necessary to make that code useful for as long as it needs to be used and allowing collaboration across a team.

  • Software engineering” differs from “programming” in dimensionality: programming is about producing code. Software engineering extends that to include the maintenance of that code for its useful life span

  • Software is sustainable when, for the expected life span of the code, we are capable of responding to changes in dependencies, technology, or product requirements. We may choose to not change things, but we need to be capable.

  • Being data driven is a good start, but in reality, most decisions are based on a mix of data, assumption, precedent, and argument. It’s best when objective data makes up the majority of those inputs, but it can rarely be all of them.

  • Software development is a team endeavor. And to succeed on an engineering team-or in any other creative collaboration-you need to reorganize your behaviors around the core principles of humility, respect, and trust.

  • It turns out that this Genius Myth is just another manifestation of our insecurity.

  • Many programmers are afraid to share work they’ve only just started because it means peers will see their mistakes and know the author of the code is not a genius.

  • The current DevOps philosophy toward tech productivity is explicit about these sorts of goals: get feedback as early as possible, test as early as possible and think about security and production environments as early as possible. This is all bundled into the idea of “shifting left” in the developer workflow; the earlier we find a problem, the cheaper it is to fix it.

  • A good postmortem should include the following:
  1. A brief summary of the event
  2. A timeline of the event, from discovery through investigation to resolution
  3. The primary cause of the event
  4. Impact and damage assessment
  5. A set of action items (with owners) to fix the problem immediately
  6. A set of action items to prevent the event from happening again
  7. Lessons learned
  • Admitting that you’ve made a mistake or you’re simply out of your league can increase your status over the long run. In fact, the willingness to express vulnerability is an outward show of humility, it demonstrates accountability and the willingness to take responsibility, and it’s a signal that you trust others’ opinions. In return, people end up respecting your honesty and strength. Sometimes, the best thing you can do is just say, “I don’t know”

read more