Control crawl behavior on domain

jolomic 發表於 2024-2-20 14:00:43

The downside is that it is somewhat limited in terms of customizability. Where to put the robots.txt file Place the robots.txt file in the root directory of the subdomain to which it applies. For example, to control crawl behavior on domain.com, you must have access to the robots.txt file at domain.com/ robots.txt. If you want to control crawling on a subdomain, such as blog.domain.com , you must have access to the robots.txt file at blog.domain.com/robots.txt . Best practices for robots.txt files Keep these points in mind to avoid common mistakes. Use a new line for each directive Each directive must be placed on a new line. Otherwise, search engines will be confused. bad example: User-agent:

* Disallow: /directory/ Disallow: /another-directory/ Good example: User-agent: * Disallow: /directory/ Disallow: /another-directory/ Use wildcards to simplifyAustralia Phone Number Data instructions The wildcard (*) not only allows directives to apply to all user agents, but also allows you to match URL patterns when declaring directives. For example, if you want to prevent search engines from accessing parameterized product category URLs on your site , you can create a list like this: User-agent: * Disallow: /products/t-shirts? Disallow: /products/hoodies? Disallow: /products/jackets? … But it's not very efficient. I recommend using wildcards to simplify things like this: User-agent: * Disallow: /products/*? This example blocks search engines from crawling all

http://zh-cn.aolists.com/wp-content/uploads/2024/02/Australia-Phone-Number-Data.jpg

URLs that contain question marks under the /product/ subfolder. In other words, you are blocking parameterized product category URLs. Use "$" to specify the end of the URL Include a "$" symbol to indicate the end of the URL. For example, if you want to prevent search engines from accessing all .pdf files on your site, your robots.txt file might look like this: User-agent: * Disallow: /*.pdf$ In this example, search engines cannot access URLs that end in .pdf. In other words, /file.pdf is not accessible, but /file.pdf?id=68937586 is accessible because it does not end with ".pdf".

頁: [1]

v's Archiver

Control crawl behavior on domain