|TurnitinBot General Information Page|
OverviewChances are that you are reading this because you found a reference to this web page from your web server logs. This reference was left by Turnitin.com's web crawling robot, also known as TurnitinBot. This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com
|Below are questions grouped into categories to help answer any questions you may have.|
Frequently Asked Questions Grouped By Category
|General Information About Web Crawlers|
|What is a web crawler?|
|How does a web crawler work?|
|What is considered good crawling etiquette?|
|General Information About Turnitin.com|
|What services does Turnitin.com offer?|
|Why does Turnitin.com need to crawl my site?|
|How can I prevent TurnitinBot from accessing certain pages on my site?|
|How can I completely exclude TurnitinBot from my site?|
|What IP Address does TurnitinBot come from?|
|What is SlySearch?|
|Problems With TurnitinBot|
|Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?|
|I changed my robots.txt file to exclude TurnitinBot but it still continues to come back?|
|How can I contact you to report a problem?|
|Q: What is a web crawler?|
|A web crawler (aka spider, robot or bot) is a computer program that scours the web gathering content. Some crawlers are specific in what they are looking for, but ours is just interested in gathering as much content as possible.|
|Q: How does a web crawler work?|
|At its most basic level, a crawler follows a simple cycle of downloading a web page, finding the links in the web page, downloading the pages referenced by these links, and so on, in a loop. A more thorough explanation can be found at http://www.robotstxt.org/wc/robots.html.|
|Q: What is considered good crawling etiquette?|
|Good crawling etiquette relies on the crawler obeying a few rules. It should read and obey the directives in the robots.txt file for a site. It should also obey META exclusion tags within pages. To not overload servers with requests, it should limit the rate at which it asks for content from a particular IP address.|
|Q: What services does Turnitin.com offer?|
|Turnitin.com offers various services to the educational community. Most prominently, we provide a widely used and effective plagiarism detection service. We also provide a Peer Review service and a series of class management tools. To learn more about our service, visit www.turnitin.com.|
|Q: Why does Turnitin.com need to crawl my site?|
|Part of the plagiarism prevention service relies on comparing student papers to content found on the Internet. Since we do not know ahead of time which pages on the Internet a student will use we need to gather them all for comparison. However, we do have automated ways of throwing away content and links that would be irrelevant to our service.|
|Q: How can I prevent TurnitinBot from accessing certain web pages on my site?|
|The Robots Exclusion Protocol allows web site maintainers the ability to communicate to a crawler which parts of their site the crawler cannot access. Furthermore, it allows the administrator the ability to create access rules on a crawler by crawler basis.
It works something like this: TurnitinBot visits a web site http://www.somewhere.com. Knowing it hasn't been here before or in a while, it tries to download http://www.somewhere.com/robots.txt. It then examines the robots.txt file for any rules which apply to it. An example of a robots.txt file is:
#This is an example robots.txt file
the rules would only apply to the TurnitinBot crawler. Please note that both the token "user-agent" and "turnitinbot" are case insensitive. For example, TurnitinBot or TURNITINBOT are equally as effective. The Disallow lines are used to exclude the crawler from particular pages on the site. In this case any page starting with /secret/ or /hide/ will be excluded. For instance, http://www.somewhere.com/secret/world.html would be excluded but http://www.somewhere.com/secret.html wouldn't be.
For a more thorough explanation please visit http://www.robotstxt.org/wc/exclusion.html.
|Q: How can I completely exclude TurnitinBot from my site?|
|To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.
Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site.
#This is an example robots.txt file
#This is an example robots.txt file
|Q: What IP Address does TurnitinBot come from?|
Turnitin use a number of different crawlers and content indexing systems, all of which share the Agent Name "TurnitinBot", and originate from one of a number of static IP addresses - which our system might randomly assign to the crawler/indexer at any given time. The two main use cases are listed below.
Content Partner organisations/Crossref Members:
Similarly, for Content Partner organisations providing their metadata (including full-text URLs) to Turnitin over FTP, to allow Turnitin to crawl this content; these IP addresses and the Agent Name "TurnitinBot" should be whitelisted to facilitate the indexing of this content.
Please note, Crossref Members/Content Partners may also see activity from Turnitin's general webcrawler on their websites; if this causes any inconvenience or slowdown in traffic, this can be handled via robots.txt restrictions or by blocking just the latter IP range (18.104.22.168 to 22.214.171.124 and 126.96.36.199 to 188.8.131.52).
|Q: What is SlySearch ?|
|SlySearch is the old name of our robot. We decided to change its name to better reflect the service it represents.|
|Q: Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?|
|There are two explanations for this. Either your site or another site has a link to your site which is incorrect, i.e. the page doesn't exist. Not knowing better we tried to follow this link generating a 404 error on your server. The other possibility is that TurnitinBot improperly parsed a link from a page.|
|Q: I changed my robots.txt file to exclude TurnitinBot but it still continues to come back?|
|If we re-requested the robots.txt file before each page request it would put a significantly larger load on servers and be wasteful of bandwidth. We get around this by caching robots.txt files. For versions Turnitinbot/1.4 and below, we cache the robots.txt file for 48 hours before we refresh our copy. As of version Turnitinbot/1.5, we dropped this value to 12 hours to better suit the needs of webmasters.|
|Q: How can I contact you to report a problem?|
|If you still have questions or want to speak with us about our crawler's behavior you can contact us at firstname.lastname@example.org. If you could please provide us with the following information it would help us determine what occurred when our crawler visited your site.
* A description of your question or problem.