Codementor Events

LLM Tracing Implementation to Analyze and Visualize LLM at Scale

Published Apr 16, 2025
LLM Tracing Implementation to Analyze and Visualize LLM at Scale

Large Language Models (LLMs) are transforming various industries by enabling advanced text generation, question-answering, and reasoning capabilities. However, their internal workings remain largely opaque, making it challenging to optimize performance, ensure reliability, and mitigate risks like bias or hallucinations.

Implementing a robust tracing system is crucial for gaining insights into LLM execution, tracking data flow, and optimizing model performance at scale. This article explores how to implement LLM tracing using open-source solutions, enabling efficient analysis and visualization of LLM operations.

Significance of Tracing

APIs, backend services, and domain-specific applications operate remotely, making it challenging to understand their execution and operational flow. Since these services interact over distributed systems, tracking every interaction, execution step, and data exchange is essential. This is where events and traces come into play as they provide deep insights into these interactions, helping teams diagnose issues, optimize performance, and ensure reliability.

Telemetry serves as a powerful process to capture these events and traces at a granular level, enabling in-depth analysis. Given the rapid evolution of AI, staying ahead of changes is crucial to maintaining system efficiency and security. As the demand for robust AI solutions grows, innovative and open-source tools are emerging to address complex business challenges.

Key Considerations for Remote Service Observability

Operational Complexity: Remote services often involve multiple components communicating asynchronously, making it difficult to track execution flow.
Event and Trace Analysis: Capturing traces helps map out how data moves between services and where potential bottlenecks or failures occur.
Role of Telemetry: By collecting and analyzing telemetry data, organizations can proactively monitor system health and optimize AI-driven applications.
Open-Source Solutions: Many open-source tools now enable seamless observability, making it easier to implement LLM tracing at scale.

With these principles in mind, let's implement an approach to capture LLM flow traces using open-source solutions.

Solution Overview

Unlike the standard approach, where we use an existing LLM solution like Gemini or ChatGPT over an API and integrate it with other services, we will leverage an open-sourced LLM model like llama3 from Hugging Face to generate next word predictions and capture telemetry metrics from every service along the flow.

OpenTelemetry metrics are a crucial source of information for businesses to understand LLM inference and performance. Arize-Phoenix is an open-source solution with extensive functionality that captures detailed metrics about the overall LLM inference and interaction. Using these metrics and traces, teams can seamlessly experiment, evaluate and optimize enterprise scale Large language models in real-time.

1. Initial Setup

Before we pull the model from Hugging Face and initialize the pipeline, let us start our tracing client Arize-Phoenix and vector database. We will pull docker images from the docker hub and run them instead of messing up with the local configurations and settings.

#start qdrant client
docker run -p 6333:6333 -p 6334:6334 -v
$(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant

#start arize-phoenix client
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest

With our vector database and phoenix client running in the background, let us install necessary Python packages to implement the LLM workflow. Qdrant is exposed over port 6333 and Arize-Phoenix is exposed over port 6006. We get to establish a gRPC connection with Arize-Phoenix over port 4317 which we are going to leverage in this implementation.

pip install -q transformers torch qdrant-client
opentelemetry-sdk opentelemetry-api arize-phoenix matplotlib

2. Implementation

Let's now dive into the model setup. For the sake of simplicity, we will download Llama3 with 1 billion parameters from Hugging Face.

import transformers
import torch

hf_model_id = "meta-llama/Llama-3.2-1B"
pipeline = transformers.pipeline("text-generation",
           model=model_id, model_kwargs={"torch_dtype": torch.bfloat16},
           device_map="auto")

The Transformers library offers an extensive set of functionalities to work and fine-tune large language models in native native Python. With the Transformers pipeline wrapper, we can easily initialize and set up an LLM for the next word generation task.

3. Text Tokenization

Humans conceive textual information, process and act accordingly, but machines are different. Machines process information in tensors or floating point numbers. So here we convert text into tokens, embed them as vectors and store or process them based on the use case.

tokenizer = transformers.AutoTokenizer.from_pretrained(hf_model_id)
input_prompt = "generate python code to read from a list in loop"
tokens = tokenizer(input_prompt, return_tensors="pt")

Here, we use the same pre-trained model for the tokenization of text and return the tokens as PyTorch tensors. With our tokens, we generate embeddings and ingest or perform similarity searches with existing vectors in the database.

from qdrant_client import QdrantClient

qdrant_client = QdrantClient("localhost", port=6333)
query_vector = pipeline.model.get_input_embeddings((tokens.input_ids)
             .mean(dim=1).detach().numpy())
search_results = qdrant_client.search(collection_name="opentelemetry_collection",
               query_vector=query_vector, top_k=3)

As we can see in the above code snipped, we established connectivity with Qdrant over port 6333 and used our model pipeline to get the embeddings. We then query and search for embeddings from the database to use them as context and predict the next word.

4. Telemetry Setup

Now we enter the interesting section of this implementation, which is setup and aggregating the telemetry information. The most fundamental and important concept in OpenTelemetry or Arize-Phoenix is span. Span is a unit of work or operation taking place in a remote/distributed environment. They are the building blocks of traces enabling monitoring. Using spans, ML/AI teams can easily analyze application performance and state at scale.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource


trace.set_tracer_provider(TracerProvider(resource=Resource.
                     create({"service.name": "llama-inference"})))
tracer = trace.get_tracer(__name__)

We have set up an OpenTelemetry tracer to capture all the traces that our model and inference actions will generate and further refine and capture them for analysis and visualization.

from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

console_exporter = ConsoleSpanExporter()
span_processor = SimpleSpanProcessor(console_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

Additionally, we can set up the console exporter for debugging purposes in the development stage. Also, setting this at the debug level can help SRE teams understand where and why an interruption or degradation in the inference flow occurred.

from phoenix.trace.otel import OpenInferenceSpanExporter

phoenix_exporter = OpenInferenceSpanExporter()
phoenix_span_processor = SimpleSpanProcessor(phoenix_exporter)
trace.get_tracer_provider().add_span_processor(phoenix_span_processor)

With an OpenTelemetry tracer in place, we can now instantiate the span exporter from the Arize-Phoenix library. This class allows us to export the spans from every interaction with our transformers pipeline.

from phoenix.trace import OpenInferenceTracer

tracer = OpenInferenceTracer()
with tracer.span(name="llama_inference") as span:
   output = pipeline(input_prompt)
   span.set_attribute("input_prompt", input_prompt)
   span.set_attribute("output", output)

Phoenix also offers an open inference tracer using which we can set a conditional flow that can govern every prompt fed to the model via the pipeline. Also, we can capture the span as attributes.

As we can see in the snippet above, traces from every inference action are aggregated, which can further be persisted in a vector or traditional database for analysis. For real-time visualization, we can try something like below.

5. Visualizing the Traces

The Arize-Phoenix utility function allows us to get the traces and loop over for exploding the information. As shown below, we can initialize the get_traces call and iterate over the traces to extract the necessary information. For simplicity, let's capture the inference latency with respect to time.

from phoenix.trace.utils import get_traces
import matplotlib.pyplot as plt

traces = get_traces()
latencies = [span.latency for span in traces]
timestamps = [span.start_time for span in traces]

plt.figure(figsize=(10, 6))
plt.plot(timestamps, latencies, marker='o', linestyle='-', color='b')
plt.title('Inference Latency Over Time')
plt.xlabel('Timestamp')
plt.ylabel('Latency (seconds)')
plt.grid(True)
plt.show()

Metrics accumulated in the latencies and timestamps variables can further be plotted as visualization to gain performance insights and implement robust optimizations.

Conclusion

As LLMs continue to evolve and power enterprise-scale applications, the need for robust observability and performance optimization has never been greater. Implementing tracing mechanisms like OpenTelemetry and Arize-Phoenix provides deep visibility into LLM workflows, enabling teams to monitor inference processes, detect bottlenecks, and refine models for better efficiency.

By leveraging open-source solutions, organizations can seamlessly track interactions, measure latency, and analyze system behavior in real time. This ensures greater transparency, faster troubleshooting, and more reliable AI-driven applications. As LLM adoption accelerates, integrating advanced tracing methodologies will be crucial for scaling AI operations while maintaining accuracy, efficiency, and trust.

Discover and read more posts from Kruti Chapaneri
get started