
Stop tracking hypothetical prompts: how to build a prompt set from real buyer queries The majority of AI visibility algorithms start with a blank spreadsheet. A group of people gets together in a room, speculates about what purchasers could ask ChatGPT, Perplexity, Gemini, or Copilot, keeps track of those prompts, and then reports the movement as though the prompt set were the market. The problem is not with timely tracking. The only practical approach to find out if AI assistants reference or cite your brand is to promptly track it. The input is frequently fictitious, which is the issue. If the prompts are fictitious, then the visibility report is also fictitious, and all decisions based on it are fictitious as well. The majority of teams already possess the better input. Google Search Console, Google Ads, and Bing Webmaster Tools have the query data. Using actual (anonymized) numbers from a B2B identity vendor's monitoring program, this article demonstrates a six-step process for transforming that data into a prompt set worth tracking. Why guessed prompts produce fiction twice There are two layered reasons why guessed prompts don't work, and it's important to be specific about both because each one affects the report differently. \n \n Sampling noise is the first. According to SparkToro's research on AI brand suggestions, identical ordering appeared less than 1 in 1,000 times, and AI tools had less than a 1 in 100 chance of providing the same brand list twice for the same prompt. A team that reports "we dropped out of ChatGPT this week" after running a single prompt once a week is primarily reporting dice rolls. They believe they are measuring a signal, but the variance between runs is greater. The first is exacerbated by the second, input error. You are monitoring a noisy consequence on a question that customers might never ask when the prompt was created in a conference room. Because this is how marketers view their category, brainstormed prompt lists tend to cluster around the obvious "best [category] tools" language. That is not how buyers speak. They inquire about team numbers, compliance needs, stack specifics, limitations, and occasionally complete error messages. In the meantime, longer, task-shaped inquiries have become more common in the engines themselves. Google presents AI Overviews as designed for increasingly complicated inquiries that require multi-step reasoning to answer, and explains AI Mode as a query fan-out strategy that divides a single inquiry into subtopics and executes several searches simultaneously. That shape contextual, limited, and decision-oriented must be reflected in your prompt set. Guesswork in conference rooms rarely does. Step 1: Pull the queries buyers already gave you Three first-party sources cover the workflow, and each one does a different job. Use all three if you have them; the workflow degrades gracefully if you only have Search Console. The demand layer is Google Search Console. The queries, pages, clicks, impressions, CTR, and average position your website currently receives in Google are all displayed in the Performance report. The bulk export to BigQuery creates a long-term query warehouse that you can update on a monthly basis, and the Search Console API programmatically returns the same data with query, page, country, and device dimensions for recurring exports. First-party logs are the greatest accessible proof of buyer language, but they do not represent the entirety of it. This is a warning to keep your technique honest because bulk export does not include anonymized queries. The commercial layer is Google Ads. The actual searches that resulted in your advertising are displayed in the search terms report , along with impressions, clicks, and conversions. The conversion column indicates which phrasings belong to buyers rather than browsers, and paid inquiries tend to favor bottom-funnel modifiers (price, alternative, integration, demo, vendor). Ignore this layer if you don't use sponsored search; the other two work just fine. The AI-citation layer is called Bing Webmaster Tools, and most teams ignore it. Grounding inquiries, cited pages, total citations, and the mapping between queries and the pages AI replies are extracted from throughout Copilot and partner experiences are all displayed in its AI Performance report. Because grounded questions indicate what the engine actually looked for when assembling an answer, that is the closest approximation any of us can currently come to first-party AI retrieval data. Expect the raw export to be messy, because the mess is the point. Caption: A real 3-month Search Console export for a B2B identity vendor: 40.3k clicks, 10.6m impressions. Note the top queries by clicks: developer utility lookups, hash comparisons, and an entire error message pasted as a query. None of these are the prompts to track, which is exactly why filtering comes before converting. \n \n Alt text: Google Search Console performance report showing noisy top queries unrelated to buyer intent. \n \n Take a moment to look at that screenshot. A verbatim pasted error message and utility lookups are the most clicked inquiries. The first lesson of the workflow is that raw query logs are not buyer prompts, which is very acceptable. They serve as proof of who appears and why. The signal is separated in the next two processes. Step 2: Clean and classify before you convert Do a cleaning pass on the export before doing anything artistic. Deduplicate, group close variants ("sso providers," "sso provider," "providers sso"), lowercase everything, and preserve the metadata (source page, clicks, impressions, location, conversions, country, device). For scoring in step 4, you'll need that metadata. Before you even open a spreadsheet, use the query filter with a regex like ^(how|what|which|best|vs|versus|why|can) to directly surface question-shaped and comparison-shaped queries. This is a useful shortcut within the Search Console itself. In roughly thirty seconds, it generates a high-signal shortlist, however it won't capture everything. Next, categorize every remaining query according to the usual search intent taxonomy : transactional, commercial, navigational, and informational. The majority of the work is done using three sorting rules. Your own brand's navigational queries are ignored because they demonstrate brand demand but don't provide useful tracking cues. \n Since "how do I solve X" is one of the most prevalent forms of an actual AI prompt, error messages and help requests are tagged rather than removed because they show what your customers are attempting to remedy. Everything else passes a straightforward keep-filter: does the query suggest an implementation task, a vendor assessment, a business issue, or a purchase decision? The telltale words include tools, pricing, alternative, integration, checklist, and how-to. Step 3: Convert queries into prompts without losing intent Teams might make mistakes at this stage in one of two ways. They either "expand" the phrase so much that the original intent vanishes and they are back to guessing with further steps, or they track the raw keyword as if it were a prompt, ignoring how users interact with assistants. The effective approach is to reconstruct the inquiry as the complete question a buyer with that task would ask for assistance after treating it as proof of a job-to-be-done. Keep the intent intact. Add the buyer's role, the company's background, and the limitations. Indicate the choice they need assistance with. Both iPullRank's prompt recipes and Tinuiti's prompt methodology advocate for full-sentence, persona-aware prompts that vary throughout journey stages as opposed to a single generic phrase for each topic. When your buying committee has different personalities, a single source query can properly seed two or three prompts. "SAML SSO providers" is a different question from a marketing-side customer who is evaluating suppliers based on price than it is from a developer who must construct the integration. You are bloating the set if you don't create the variant when the persona actually modifies the response you want to be mentioned in. Here is the transformation pattern (example queries are illustrative): \n | Real query pattern | Intent | Conversational prompt to track | What the rewrite adds | |----|----|----|----| | api security tools | Commercial evaluation | What are the best API security tools for a mid-market SaaS team that needs runtime protection and compliance reporting? | Buyer type, use case, evaluation criteria | | How to monitor ai search visibility | Instructional | How should a B2B SaaS marketing team monitor whether AI assistants mention or cite its content? | Turns a how-to into a task someone would hand an assistant | | saml sso providers | Commercial evaluation | Best SAML SSO providers for B2B SaaS | Category plus segment, kept close to source | | [vendor] alternative | Decision stage | Cheaper alternatives to [Vendor A] / [Vendor B] for SSO and user management | Converts a comparison keyword into a buying decision | | soc 2 checklist | Implementation | What should a 50-person SaaS company prioritize in a SOC 2 readiness checklist? | Company size and prioritization, same job-to-be-done | And the caution that should be included in every iteration of this table is to stay near the source. If the question is "SOC 2 checklist," "What is the future of compliance automation?" is not the prompt. Retrieval on its side will be expanded by the engine's own fan-out. Because this approach outperforms brainstorming, your prompt set should remain anchored in observed buyer language. \n Caption: The converted prompt set in tracking. Each prompt traces back to query patterns from the step 1 exports, and each carries its own visibility score, presence rate, and sentiment rather than disappearing into a single blended number. \n Alt text: AI visibility dashboard listing conversational prompts with per-prompt scores, presence rates, and sentiment. Step 4: Build 20 to 40 prompts across the messy middle, not 500 Avoid the enormous library of artificial prompts. According to SE Ranking's guidelines , you should start with 20 to 40 prompts, run them across two to three AI models, and track them for at least 30 days before making any conclusions. Because each prompt in a small evidence-based set addresses a genuine question, it is superior to a big guessing set. Additionally, a set that can be reviewed on a weekly basis generates judgments. Weigh decision-stage shapes when determining which query patterns should be given one of those few slots. Research on which prompts actually generate revenue reveals that AI visibility connects to pipeline rather than a vanity dashboard in bottom-of-funnel and late-middle-funnel prompts, where a strong response must identify vendors. \n Balance is just as important as size. A prompt set composed solely of "best X" comparison prompts ignores the majority of your buyers' actual journey because Google's chaotic middle research depicts purchasing as looping between exploration and evaluation rather than marching down a funnel. 5 to 8 awareness prompts, 8 to 12 evaluation prompts, 5 to 8 instructional prompts, 3 to 6 brand-specific prompts, and 3 to 6 transactional prompts are a starting allocation that has proven effective in practice. Let your own query distribution and sales motion reshape it after treating it as a starting mix. \n Each candidate query should be scored on five criteria in order to determine the 20 to 40: \n - First-party evidence (0 to 3 points, more if it appears in multiple sources). \n - Buyer intent (0 to 3), AI prompt fit (0 to 2, meaning the query naturally becomes a multi-clause question). \n - Content ownership (0 to 1, you have or can credibly build a page that answers it) \n - Strategic coverage (0 to 1, it fills a persona or journey-stage gap). \n \n Anything with a score of seven or more is accepted. Everyone's pet triggers are kept out of the set during the hour-long scoring process, which sounds bureaucratic. Step 5: Track visibility as a probability, not a rank Expectations regarding the potential of AI visibility tracking should be permanently adjusted in light of SparkToro's inconsistent numbers. Stopping tracking is not the solution. Tracking distributions is the solution: execute each prompt multiple times, across engines, and provide rates rather than positions. \n | Metric | Definition | Why it matters | |----|----|----| | Visibility percentage | Brand-mentioned responses ÷ total runs | Stops overreaction to one noisy answer | | Citation rate | Responses citing your domain ÷ total runs | Measures whether your content is the source | | Competitor co-mention rate | Responses naming target competitors ÷ total runs | Shows whether you are in the consideration set | | Answer gap | Prompts answered well with no content of yours to support them | Turns tracking into a content roadmap | | Grounding-query alignment | Overlap between your prompts and Bing grounding queries | Connects the set to observed AI retrieval | Three findings from the identity vendor's first month show why per-prompt, per-engine reporting beats a blended score. First, there was a huge engine spread on the same prompt set: per-engine visibility ratings of 41.5 versus 6.1, and a presence rate of 58% on one helper compared to 2% on another. That would have been averaged into a meaningless middle by a single blended statistic, hiding both the emergency that needed to be fixed and the win that was worth emphasizing. Caption: Same prompts, six engines, wildly different outcomes. The blended visibility score (11.6) conceals a 58% presence rate on one engine and 2% on another. \n Alt text: Per-engine breakdown table showing large differences in visibility score and presence rate across AI engines. \n \n \n Second, the content roadmap was derived from the lost-prompt report. Prompts such as "Best SCIM provisioning platforms" ran 16 times with no brand representation out of 940 recorded responses, but rivals averaged 3.9 mentions per response. It's not a vanity metric. That is an actual list of discussions that customers have with assistants in their native tongue, in which you are not mentioned and competitors are suggested. Every row corresponds to a single content choice.This comparison of traditional versus AI-powered programmatic SEO is a helpful way to see when query-led content systems make sense and when they collapse into thin pages. Teams who run this loop at scale typically end up templating it. Caption: Lost prompts: queries where the brand was absent across all runs while at least one competitor was mentioned, with average competitor mentions per response. Because every prompt is query-sourced, every gap is a validated content opportunity. \n Alt text: Dashboard tables of weakest, strongest, and lost prompts ranked by visibility score and competitor mentions. \n \n Third, the off-site approach was reinterpreted through citation analysis. Just 4% of the 11,835 citations found during the window pointed to brand-owned pages, while 16% pointed to pages owned by competitors; the remaining 80% went to neutral third-party domains, such as industry journals and community threads. The commons, not the suppliers, are typically cited by engines. That distribution indicates where to gain presence next for a prompt set constructed from actual queries: the impartial sources the engines already rely on for your subjects. Caption: Citation mix across 11,835 recorded citations: 4% brand-owned, 16% competitor-owned, 80% neutral third-party domains. The engines' favorite sources are mostly nobody's website. \n Alt text: Donut chart and domain table showing brand, competitor, and neutral citation share in AI answers. \n Step 6: Refresh monthly, prune quarterly The hidden advantage of a prompt set created from query data over a brainstorming list that becomes outdated the day it is shipped is its inherent maintenance cycle. \n Every month, go over the new Bing grounding queries, Ads search keywords, and Search Console inquiries that have emerged since the previous cycle. Score the new patterns and advertise anything that scores seven or above. \n Pruning should be done on a quarterly basis. You should merge near-duplicates, retire prompts that no longer fit actual query patterns, and adjust the journey-stage mix in relation to your current sales motion. Avoid evaluating a revised set in the first week since the 30-day minimum baseline is applied following each major modification to the set. As the program develops, there are three failure modes to be aware of. Set bloat: until the set is no longer reviewable, each stakeholder adds "just one more prompt"; your defense is the scoring threshold. Rewriting prompts based on this month's news terminology rather than observable queries is known as "trend chasing," because it reintroduces the guessing problem indirectly. And metric drift: when someone asks for "our ChatGPT ranking," the truthful response is still that there are no ranks in a probabilistic medium; instead, visibility % across a rolling window is the figure that remains constant. The takeaway Reports on hypothetical visibility are generated by hypothetical prompt sets. Exporting the queries that buyers have already typed, cleaning and classifying them according to intent, rebuilding the keepers as the complete questions a buyer would ask an assistant, capping the set at 20 to 40 with a scoring threshold, reporting rates rather than ranks, and scheduling refreshes from new query data are all part of the fix, which takes about an afternoon. The difference between tracking the market and tracking your own imagination is that every figure in the resulting report can be traced back to demonstrated demand. \ \ \n \n \
View original source — Hacker Noon ↗



