
✨ Overview This article explores how Gemini tokenizes data and demonstrates how to count or estimate tokens locally. You'll learn how to use the local tokenizer to estimate text token counts offline, understand the tokenization math for multimodal inputs (images, audio, video, PDFs), and see how to retrieve precise token usage metadata from API responses for accurate tracking and billing. ℹ️ The complete source code is available in this notebook (including all setup details and future updates) under the Apache 2.0 license. You can also directly open the notebook in Colab . This article reproduces all the results generated by a click on “Run all”. ⚙️ Setup 🐍 Google Gen AI Python SDK To call the Gemini API, we'll use the Google Gen AI Python SDK . The Gemini API provides a count_tokens method, and the SDK offers an experimental implementation of a LocalTokenizer class. Make sure you have a recent version of the google-genai package with its local-tokenizer extra: pip install --quiet "google-genai[local-tokenizer]>=2.9.0" 🛠️ Google Cloud Project To get started using the Gemini API on Agent Platform, you must have an existing Google Cloud project and enable the Agent Platform API . Learn more about setting up a project and a development environment . import os PROJECT_ID = "" LOCATION = "global" if not PROJECT_ID: PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT") assert PROJECT_ID, "❌ Please set the PROJECT_ID variable" if not LOCATION: LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global") 🤖 Gen AI SDK Client To interact with the Gemini API, we initialize a genai.Client . Since we're using the enterprise-ready Agent Platform backend (formerly Vertex AI), we pass enterprise=True along with our Google Cloud project and location : from google import genai def print_configuration(client: genai.Client) -> None: service = "Agent Platform" if client.vertexai else "Google AI" print(f"ℹ️ Using the {service} API", end="") if client._api_client.project: print(f' with project "{client._api_client.project[:7]}…"', end="") print(f' in location "{client._api_client.location}"') elif client._api_client.api_key: api_key = client._api_client.api_key print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', end="") print(f" (in case of error, make sure it was created for {service})") client = genai.Client(enterprise=True, project=PROJECT_ID, location=LOCATION) print_configuration(client) ℹ️ Using the Agent Platform API with project "lpdemo-…" in location "global" 🧠 Gemini Model We'll use gemini-3.1-flash-lite as our default model for token counting and content generation. This lightweight, fast model is ideal for high-throughput tasks. MODEL_ID = "gemini-3.1-flash-lite" 🧩 The Basics: Tokens and Tokenizers Tokens Large language models (LLMs) don't process our inputs directly, nor do they generate the final text or media we see. Instead, they operate on fundamental units called tokens, ingesting them as inputs and generating them as outputs. Here's what happens when we send an LLM request: Our inputs are transformed into tokens. In other words, they are tokenized. The model generates output tokens, which represent the most likely next tokens based on the overall context. These output tokens are transformed back into the final content we can use. You can think of a token as a piece of information, and this tokenization process acts as an information compression codec: Encoding: Input → Input tokens Decoding: Output tokens → Output Tokenization is necessary to compress information to the right level of semantic granularity, allowing the model's attention mechanism to focus and develop an understanding of the provided data. Tokenizers Gemini is natively multimodal and accepts text, images, audio, video, and PDFs. These media types can be processed by a set of three tokenizers: | Input | Text Tokenizer | Image Tokenizer | Audio Tokenizer | Comment | |:---|:---:|:---:|:---:|----| | Text | ✅ | | | The original tokenizer type, when LLMs were only chatbots. | | Image | | ✅ | | An image ~~is~~ can be worth a thousand ~~words~~ tokens! | | Audio | ✅ | | ✅ | Text tokens are used for timestamps ( MM:SS or H:MM:SS ). | | Video | ✅ | ✅ | [✅] | By default, one frame is sampled per second, along with its corresponding timestamp. Audio is optional for videos. | | PDF | ✅ | ✅ | | PDFs are processed by vision tokenizers. Text tokens are used for OCR and pagination data. | As you can see, up to three tokenizers can be involved, depending on the modality. 💡 Keep in mind that not all underlying tokens are necessarily billed. See the usage_metadata section below for examples of tokens actually billed per modality. Vocabulary The complete set of unique tokens that an LLM can ingest or generate makes up its vocabulary. Once an LLM is trained, its vocabulary is fixed and is used for inference. A vocabulary is essentially a lookup table mapping text sequences to token IDs (which correspond to vector representations in a semantic space). This means tokenizers are simply algorithms that use this vocabulary to encode and decode tokens (i.e., to convert data to and from token IDs). For example, the Gemini text tokenizers process common words like this: | Text | Tokens | Tokenization | Token IDs | |----|:---:|----|----| | hello | 1 | A single token for most common sequences | 23391 | | passion | 1 | passion | 208039 | | passionate | 2 | pass • ionate | 4373 • 84242 | | passionné | 2 | passion • né (passionate in French) | 208039 • 8504 | | passionately | 2 | passion • ately | 208039 • 2295 | | passionalmente | 2 | pass • ionalmente (passionately in Italian) | 4373 • 134916 | 💡 As you can see, words with the same root aren't necessarily split the same way. Text tokenizers have no concept of syllables, prefixes, or suffixes. They don't think like linguists or grammarians; they think like statisticians and look for statistically optimal combinations. 🌐 Baseline: API Token Counting The Gemini API lets you count tokens for any multimodal input by sending a count_tokens request. While you need to be authenticated to use it, this method is free of charge, so you can audit your prompts before committing to a paid request. Likewise, the compute_tokens method lets you retrieve the list of corresponding tokens and token IDs. Let's reproduce the previous table: from collections.abc import Iterator import IPython.display from google.genai.types import ( ComputeTokensResponse, ComputeTokensResult, CountTokensResponse, CountTokensResult, ) RowData = tuple[str, str, str, str] def display_token_info_from_api(model: str, texts: list[str]) -> None: def yield_data() -> Iterator[RowData]: for text in texts: count_result = client.models.count_tokens(model=model, contents=text) compute_result = client.models.compute_tokens(model=model, contents=text) yield get_text_token_info(text, count_result, compute_result) display_token_info(yield_data()) def display_token_info(yield_data: Iterator[RowData]) -> None: def yield_row() -> Iterator[RowData]: yield "Text", "Tokens", "Tokenization", "Token IDs" yield "-", ":-:", "-", "-" yield from yield_data markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row()) IPython.display.display(IPython.display.Markdown(markdown)) def get_text_token_info( text: str, count_tokens_res: CountTokensResponse | CountTokensResult, compute_tokens_res: ComputeTokensResponse | ComputeTokensResult, ) -> RowData: def inline_code(s: str) -> str: return f"`{s}`" total_tokens = count_tokens_res.total_tokens tokens_info = compute_tokens_res.tokens_info assert tokens_info is not None and len(tokens_info) == 1 info = tokens_info[0] assert info.tokens is not None and info.token_ids is not None tokenization = " • ".join(t.decode("utf-8", errors="replace") for t in info.tokens) token_ids = " • ".join(str(token_id) for token_id in info.token_ids) return ( inline_code(text), str(total_tokens), inline_code(tokenization), inline_code(token_ids), ) TEXTS = [ "hello", "passion", "passionate", "passionné", "passionately", "passionalmente", ] display_token_info_from_api(MODEL_ID, TEXTS) | Text | Tokens | Tokenization | Token IDs | |----|:---:|----|----| | hello | 1 | hello | 23391 | | passion | 1 | passion | 208039 | | passionate | 2 | pass • ionate | 4373 • 84242 | | passionné | 2 | passion • né | 208039 • 8504 | | passionately | 2 | passion • ately | 208039 • 2295 | | passionalmente | 2 | pass • ionalmente | 4373 • 134916 | 🚀 Why Count Tokens Locally? Here are a few use cases where counting (or just estimating) tokens locally is useful: Offline & Speed: You can count tokens completely offline. Plus, even when you're online, doing it locally means you don't have to wait for a network round-trip to the Gemini API just to check your prompt size. Quotas: While the count_tokens method is free, counting locally saves bandwidth and prevents you from hitting API rate limits, especially during high-volume token counting. Latency: You can estimate how much time is needed to process your text input before you start receiving a response (for a given model, the time-to-first-token latency is roughly proportional to the number of input tokens). Cost Control: You can estimate and budget your API costs before committing to a paid request. Routing: Knowing which token-count bucket your input falls into lets you route requests to different models based on speed, cost, or context size. Privacy: You can audit the token count of sensitive data without sending it over your network. 🔤 Using the Local Text Tokenizer Create a local tokenizer for the specific Gemini model you're using: from google.genai.local_tokenizer import LocalTokenizer tokenizer = LocalTokenizer(model_name=MODEL_ID) 💡 Remarks Creating a tokenizer takes a few seconds, during which the configuration and vocabulary are loaded into memory. On the first call, the tokenizer data is downloaded and stored in a local cache. This step requires an internet connection and about 30MB of storage. If you want to build a fully offline solution, you can check out the SDK source code and persist the tokenizer assets (e.g., by configuring a persistent cache directory or building a container image). Checking the internal tokenizer name confirms that the Gemma open-weight models share the same text tokenizer as the Gemini 3 family: print(f'Text tokenizer name for "{MODEL_ID}": "{tokenizer._tokenizer_name}"') Text tokenizer name for "gemini-3.1-flash-lite": "gemma4" Call the count_tokens() method on a small text input: contents = "Hello World!" result = tokenizer.count_tokens(contents) print(f"{result.total_tokens=}") result.total_tokens=3 Now, let's reproduce the previous API tokenization tests with our local tokenizer: def display_token_info_from_local_tokenizer(tokenizer: LocalTokenizer, texts: list[str]) -> None: def yield_data() -> Iterator[RowData]: for text in texts: count_result = tokenizer.count_tokens(contents=text) compute_result = tokenizer.compute_tokens(contents=text) yield get_text_token_info(text, count_result, compute_result) display_token_info(yield_data()) display_token_info_from_local_tokenizer(tokenizer, TEXTS) | Text | Tokens | Tokenization | Token IDs | |----|:---:|----|----| | hello | 1 | hello | 23391 | | passion | 1 | passion | 208039 | | passionate | 2 | pass • ionate | 4373 • 84242 | | passionné | 2 | passion • né | 208039 • 8504 | | passionately | 2 | passion • ately | 208039 • 2295 | | passionalmente | 2 | pass • ionalmente | 4373 • 134916 | 💡 As expected, we get exactly the same results, but with 100% local execution this time. Finally, let's download a longer text, like Hamlet : import requests def get_text_from_url(content_url: str, force_encoding: str = "") -> str: response = requests.get(content_url, timeout=10) response.raise_for_status() if force_encoding: # Use for HTTP headers with unknown/incorrect charset response.encoding = force_encoding return response.text TEXT_URL = "https://storage.googleapis.com/dataflow-samples/shakespeare/hamlet.txt" contents = get_text_from_url(TEXT_URL) print(contents[:256] + "[…]") HAMLET DRAMATIS PERSONAE CLAUDIUS king of Denmark. (KING CLAUDIUS:) HAMLET son to the late, and nephew to the present king. POLONIUS lord chamberlain. (LORD POLONIUS:) HORATIO friend to Hamlet. LAERTES son to Polonius. LUCIANUS nephew to the kin[…] How many tokens do we need to encode Hamlet ? result = tokenizer.count_tokens(contents) print(f"{result.total_tokens=:,}") result.total_tokens=54,660 💡 Hamlet gets broken down locally into 50k+ tokens in a fraction of a second. If you tokenize War and Peace , you'll get 850k+ tokens. 🕵️♂️ Accounting for "Hidden" Tokens When you send a request to Gemini, the total input token count isn't always just the sum of your input data. To keep things simple, we tested text token counts with default parameters. The count_tokens and compute_tokens methods both have a config parameter. Depending on your request configuration, your inputs and outputs may include additional tokens. Keep an eye out for these hidden additions: System Instructions: Any system prompt you set will add to the total token count. Thinking: If thinking is enabled, an internal chain of thought can generate additional thinking tokens. Tools and Functions: If you provide a list of tools (like Python execution or custom functions), their declarations, calls, and responses are part of your prompt payload. Response Schema: Enforcing structured outputs (like JSON) requires the model to process the schema definition you provide, which consumes input tokens. Chat History: In multi-turn conversations, the entire chat history is sent back to the model with every new message, meaning your input token count grows with each turn. 🧮 Multimodal Token Math Multimodal inputs (images, audio, video, and documents) aren't tokenized like text. They usually have specific calculation rules based on the model (and its underlying tokenizers), the media type, and the request configuration. For multimodal inputs, refer to the documentation for details on how token counts are calculated for different media types: Image understanding Audio understanding Video understanding Document understanding There are generally multiple tokenization options, even for a single modality. You can use the count_tokens method and the calculation rules to estimate the token count of your own payloads. To get a clearer picture, let's look at actual requests and see how token counts are broken down by modality… 🎯 Tracking Actual Token Usage While estimating token counts is super useful, you should always rely on the usage_metadata returned in the API response when you need to track your actual usage down to the exact token. It's the single source of truth for billing. Here's the gist of how usage_metadata lets you get the token counts by modality: class GenerateContentResponse: # … usage_metadata: Optional[GenerateContentResponseUsageMetadata] # … class GenerateContentResponseUsageMetadata: # … prompt_token_count: Optional[int] prompt_tokens_details: Optional[list[ModalityTokenCount]] # … class ModalityTokenCount: modality: Optional[MediaModality] token_count: Optional[int] class MediaModality(StrEnum): MODALITY_UNSPECIFIED = "MODALITY_UNSPECIFIED" TEXT = "TEXT" IMAGE = "IMAGE" VIDEO = "VIDEO" AUDIO = "AUDIO" DOCUMENT = "DOCUMENT" 🐍 Let's define a few helpers: from google.genai.types import ( FileData, GenerateContentResponse, MediaModality, Part, PartMediaResolution, PartMediaResolutionLevel, VideoMetadata, ) TokensPerModality = dict[MediaModality, int] def display_tokens_per_modality(response: GenerateContentResponse) -> None: usage_metadata = response.usage_metadata if not usage_metadata: print("⚠️ No usage metadata found in the response.") return prompt_tokens_details = usage_metadata.prompt_tokens_details or [] tokens_per_modality = get_empty_tokens_per_modality() for tokens_details in prompt_tokens_details: modality = tokens_details.modality if modality and modality in tokens_per_modality: tokens_per_modality[modality] += tokens_details.token_count or 0 prompt_token_count = usage_metadata.prompt_token_count or 0 display_token_table(tokens_per_modality, prompt_token_count) def get_empty_tokens_per_modality() -> TokensPerModality: return { modality: 0 for modality in MediaModality if modality != MediaModality.MODALITY_UNSPECIFIED } def display_token_table( tokens_per_modality: TokensPerModality, total_tokens: int, ) -> None: def yield_row() -> Iterator[list[str]]: yield [mod.value for mod in tokens_per_modality.keys()] + ["Total"] yield [":-:" for _ in range(len(tokens_per_modality) + 1)] yield [f"{t:,d}" for t in tokens_per_modality.values()] + [f"{total_tokens:,d}"] markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row()) IPython.display.display(IPython.display.Markdown(markdown)) Let's check a few examples… 🖼️ Image Tokenization Image token counts depend on the image itself and the configured media resolution: class PartMediaResolutionLevel(StrEnum): MEDIA_RESOLUTION_UNSPECIFIED = "MEDIA_RESOLUTION_UNSPECIFIED" MEDIA_RESOLUTION_LOW = "MEDIA_RESOLUTION_LOW" MEDIA_RESOLUTION_MEDIUM = "MEDIA_RESOLUTION_MEDIUM" MEDIA_RESOLUTION_HIGH = "MEDIA_RESOLUTION_HIGH" MEDIA_RESOLUTION_ULTRA_HIGH = "MEDIA_RESOLUTION_ULTRA_HIGH" For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per image: | media_resolution | Tokens | |----|---:| | MEDIA_RESOLUTION_LOW | 280 | | MEDIA_RESOLUTION_MEDIUM | 560 | | MEDIA_RESOLUTION_HIGH (default) | 1,120 | | MEDIA_RESOLUTION_ULTRA_HIGH | 2,240 | 🐍 Check how this cat image is tokenized by default: def display_tokens_for_image( image_uri: str, media_resolution_level: PartMediaResolutionLevel | None = None, ) -> None: print(f"🧪 {media_resolution_level=}") contents = Part.from_uri( file_uri=image_uri, mime_type="image/*", media_resolution=( PartMediaResolution(level=media_resolution_level) if media_resolution_level else None ), ) response = client.models.generate_content(model=MODEL_ID, contents=contents) display_tokens_per_modality(response) IMAGE_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/image/chair-cat.png" display_tokens_for_image(IMAGE_URI) 🧪 media_resolution_level=None | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 1,080 | 0 | 0 | 0 | 1,080 | 💡 This image is tokenized into only 1,080 tokens (instead of the maximum 1,120), saving us 40 tokens! It's a nice touch that helps keep costs down rather than defaulting to the upper limit. 🐍 For less detailed images, you can reduce token counts by a factor of 2 or 4 using the medium or low levels: display_tokens_for_image(IMAGE_URI, PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW) 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 264 | 0 | 0 | 0 | 264 | 💡 At the other end of the media resolution range, the ultra-high level is great for detailed images (like a photo of a circuit board with many components), ensuring maximum visual understanding. An image at this level uses between 2,000 and 2,240 tokens. 🔊 Audio Tokenization Audio tokenization currently uses 25 tokens per second to represent the audio stream semantically. 🐍 Here is the tokenization for a 3.049-second audio file: def display_tokens_for_audio(audio_uri: str) -> None: contents = Part.from_uri(file_uri=audio_uri, mime_type="audio/*") response = client.models.generate_content(model=MODEL_ID, contents=contents) display_tokens_per_modality(response) AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/hello_gemini_are_you_there.wav" display_tokens_for_audio(AUDIO_URI) | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 0 | 0 | 77 | 0 | 77 | 💡 ceil(3.049 s × 25 tok/s) = ceil(76.225 tok) = 77 tok 🐍 A longer, 30.772-second audio file requires 10 times as many tokens, as expected: AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/sailor_audio.mp3" display_tokens_for_audio(AUDIO_URI) | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 0 | 0 | 770 | 0 | 770 | 💡 ceil(30.772 s × 25 tok/s) = ceil(769.3 tok) = 770 tok 🎬 Video Tokenization For videos: The audio tokenizer is the same as for standalone audio (25 tokens per second). Video frames are sampled (1 FPS by default) and tokenized based on the media resolution. For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per sampled frame: | media_resolution | Max. tokens | |----|---:| | MEDIA_RESOLUTION_LOW / MEDIA_RESOLUTION_MEDIUM (default) | 70 | | MEDIA_RESOLUTION_HIGH | 280 | 🐍 Here's the tokenization for a 59-second video: def display_tokens_for_video( video_uri: str, fps: float | None = None, media_resolution_level: PartMediaResolutionLevel | None = None, ) -> None: print(f"🧪 {fps=}, {media_resolution_level=}") contents = Part( file_data=FileData(file_uri=video_uri, mime_type="video/*"), video_metadata=VideoMetadata(fps=fps) if fps is not None else None, media_resolution=( PartMediaResolution(level=media_resolution_level) if media_resolution_level else None ), ) response = client.models.generate_content(model=MODEL_ID, contents=contents) display_tokens_per_modality(response) VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk" display_tokens_for_video(VIDEO_URI) 🧪 fps=None, media_resolution_level=None | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 0 | 3,894 | 1,475 | 0 | 5,369 | 💡 Details Video: ceil(59 s × 1 frame/s × 66 tok/frame) = ceil(3894 tok) = 3894 tok Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok 🐍 Doubling the sampling rate requires twice as many video tokens: display_tokens_for_video(VIDEO_URI, fps=2) 🧪 fps=2, media_resolution_level=None | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 0 | 7,788 | 1,475 | 0 | 9,263 | 💡 Details Video: ceil(59 s × 2 frame/s × 66 tok/frame) = ceil(7788 tok) = 7788 tok Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok 🐍 If you switch from low/medium to high media resolution, sampled frames are tokenized in greater detail, requiring four times as many video tokens: VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk" display_tokens_for_video( VIDEO_URI, media_resolution_level=PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH, ) 🧪 fps=None, media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 0 | 15,576 | 1,475 | 0 | 17,051 | 💡 Details Video: ceil(59 s × 1 frame/s × 264 tok/frame) = ceil(15576 tok) = 15576 tok Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok 📄 Document Tokenization For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per PDF page: | media_resolution | Tokens | |----|---:| | MEDIA_RESOLUTION_LOW | 280 | | MEDIA_RESOLUTION_MEDIUM (default) | 560 | | MEDIA_RESOLUTION_HIGH | 1,120 | 🐍 Here's the tokenization for a one-page PDF at different media resolutions: def display_tokens_for_document( document_uri: str, media_resolution_level: PartMediaResolutionLevel | None = None, ) -> None: print(f"🧪 {media_resolution_level=}") contents = Part.from_uri( file_uri=document_uri, mime_type="application/pdf", media_resolution=( PartMediaResolution(level=media_resolution_level) if media_resolution_level else None ), ) response = client.models.generate_content(model=MODEL_ID, contents=contents) display_tokens_per_modality(response) DOCUMENT_URI = ( "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf" ) media_resolution_levels = [ PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW, PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM, PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH, ] for media_resolution_level in media_resolution_levels: display_tokens_for_document(DOCUMENT_URI, media_resolution_level) 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 266 | 0 | 0 | 0 | 266 | 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 532 | 0 | 0 | 0 | 532 | 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 1,092 | 0 | 0 | 0 | 1,092 | 💡 Remarks Low: 266 tok/pg Medium: 532 tok/pg High: 1092 tok/pg 🐍 Here's another test for a 15-page PDF: DOCUMENT_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf" for media_resolution_level in media_resolution_levels: display_tokens_for_document(DOCUMENT_URI, media_resolution_level) 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 3,990 | 0 | 0 | 0 | 3,990 | 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 7,800 | 0 | 0 | 0 | 7,800 | 🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'> | TEXT | IMAGE | VIDEO | AUDIO | DOCUMENT | Total | |:---:|:---:|:---:|:---:|:---:|:---:| | 0 | 16,530 | 0 | 0 | 0 | 16,530 | 💡 Remarks Low: 3990 tok / 15 pg = 266 tok/pg Medium: 7800 tok / 15 pg = 520 tok/pg High: 16530 tok / 15 pg = 1102 tok/pg 🎉 Conclusion You've now mastered token counting both locally and via the Gemini API! With the LocalTokenizer , you can estimate text token counts completely offline, saving bandwidth and avoiding rate limits. You've also seen how Gemini's multimodal tokenizers handle images, audio, video, and PDFs, and how to extract precise token usage from usage_metadata for accurate tracking and billing. ➕ More! Try it yourself: Use the companion notebook (or run the notebook on Colab ) to reproduce all results in this article. Get inspired: Explore typical use cases in the Agent Platform Prompt Gallery . Stay updated: Follow the Agent Platform Release Notes . Follow me: Connect with me (@PicardParis) on LinkedIn or Twitter / X for more cloud, applied AI, and Python explorations… \
View original source — Hacker Noon ↗


