I noticed that OpenAI has been hitting my website and hitting it hard. It will hit pages with query params like ?page=1513\ which obviously does not exist. I know I can deny AI Bots in Vercel which I have done in the meantime to stop OpenAI from hitting the website hard. However, I was wondering if there’s a better way to at least let OpenAI and other AI bots know what pages are actually available so it’s not just randomly scraping my website over and over again with incorrect params and paths.
Here is the User Agent according to Vercel
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Do AI’s adhere to llms.txt? Reading through What Is llms.txt, and Should You Care About It?, it looks like none of the major ones support it. Furthermore, what would an llms.txt look like? I couldn’t find an example in the docs other than you could define the route.
I am just curious how it would look to tell the llm what pages are available etc.
There are a few tools out there now to convert sitemaps.xml to llms.txt if you don’t want to build your own solution. I haven’t tried this one myself, but llmstxt looks like a convenient option
Thanks! I will take a look. I don’t have paginated pages in my sitemap.xml because I couldn’t find any solid advise on if they should or shouldn’t be included and I figured since I have pagination links it wouldn’t be needed. It would be nice if there was a guide on creating an llms.txt or an example.