OpenAI destroying my usage

I noticed that OpenAI has been hitting my website and hitting it hard. It will hit pages with query params like ?page=1513\ which obviously does not exist. I know I can deny AI Bots in Vercel which I have done in the meantime to stop OpenAI from hitting the website hard. However, I was wondering if there’s a better way to at least let OpenAI and other AI bots know what pages are actually available so it’s not just randomly scraping my website over and over again with incorrect params and paths.

Here is the User Agent according to Vercel

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

llms.txt is what you’d want to use to help LLMs understand you site’s structure and content

1 Like

Do AI’s adhere to llms.txt? Reading through What Is llms.txt, and Should You Care About It?, it looks like none of the major ones support it. Furthermore, what would an llms.txt look like? I couldn’t find an example in the docs other than you could define the route.

I am just curious how it would look to tell the llm what pages are available etc.

This post has a nice explanation: LLMs.txt Explained | TDS Archive

There are a few tools out there now to convert sitemaps.xml to llms.txt if you don’t want to build your own solution. I haven’t tried this one myself, but llmstxt looks like a convenient option

Thanks! I will take a look. I don’t have paginated pages in my sitemap.xml because I couldn’t find any solid advise on if they should or shouldn’t be included and I figured since I have pagination links it wouldn’t be needed. It would be nice if there was a guide on creating an llms.txt or an example.

Examples are definitely helpful! You can find format, examples, and more info at llmstxt.org