TurnitinBot General Information Page

Overview

Chances are that you are reading this because you found a reference to this web page from your web server logs. This reference was left by Turnitin.com's web crawling robot, also known as TurnitinBot. This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com
Below are questions grouped into categories to help answer any questions you may have.

Frequently Asked Questions Grouped By Category

    General Information About Web Crawlers
What is a web crawler?
How does a web crawler work?
What is considered good crawling etiquette?
    General Information About Turnitin.com
What services does Turnitin.com offer?
Why does Turnitin.com need to crawl my site?
    Controlling TurnitinBot
How can I prevent TurnitinBot from accessing certain pages on my site?
How can I completely exclude TurnitinBot from my site?
What IP Address does TurnitinBot come from?
    Problems With TurnitinBot
Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?
How can I contact you to report a problem?
Q: What is a web crawler?
A web crawler (aka spider, robot or bot) is a computer program that scours the web gathering content. Some crawlers are specific in what they are looking for, but ours is just interested in gathering as much content as possible.
Q: How does a web crawler work?
At its most basic level, a crawler follows a simple cycle of downloading a web page, finding the links in the web page, downloading the pages referenced by these links, and so on, in a loop. A more thorough explanation can be found at https://www.robotstxt.org/robotstxt.html.
Q: What is considered good crawling etiquette?
Good crawling etiquette relies on the crawler obeying a few rules. It should read and obey the directives in the robots.txt file for a site. It should also obey META exclusion tags within pages. To not overload servers with requests, it should limit the rate at which it asks for content from a particular IP address.
Q: What services does Turnitin.com offer?
Turnitin.com offers various services to the educational community. Most prominently, we provide a widely used and effective plagiarism detection service. We also provide a Peer Review service and a series of class management tools. To learn more about our service, visit www.turnitin.com.
Q: Why does Turnitin.com need to crawl my site?
Part of the plagiarism prevention service relies on comparing student papers to content found on the Internet. Since we do not know ahead of time which pages on the Internet a student will use we need to gather them all for comparison. However, we do have automated ways of throwing away content and links that would be irrelevant to our service.
Q: How can I prevent TurnitinBot from accessing certain web pages on my site?
The Robots Exclusion Protocol allows web site maintainers the ability to communicate to a crawler which parts of their site the crawler cannot access. Furthermore, it allows the administrator the ability to create access rules on a crawler by crawler basis.

It works something like this: TurnitinBot visits a web site http://www.somewhere.com. Knowing it hasn't been here before or in a while, it tries to download http://www.somewhere.com/robots.txt. It then examines the robots.txt file for any rules which apply to it. An example of a robots.txt file is:

#This is an example robots.txt file
User-agent: *
Disallow: /secret/
Disallow: /hide/

Lines starting with # are comments and are ignored by the crawler. The User-agent line is used to indicate which crawler(s) should abide by the rules. In this case, a * means all crawlers. If it were

User-agent: turnitinbot

the rules would only apply to the TurnitinBot crawler. Please note that both the token "user-agent" and "turnitinbot" are case insensitive. For example, TurnitinBot or TURNITINBOT are equally as effective. The Disallow lines are used to exclude the crawler from particular pages on the site. In this case any page starting with /secret/ or /hide/ will be excluded. For instance, http://www.somewhere.com/secret/world.html would be excluded but http://www.somewhere.com/secret.html wouldn't be. Note: you may see the Turnitin crawler use the user-agent "Turnitin" rather than "TurnitinBot"; these are equivalent, and the Turnitin crawler will respect robots.txt exclusions for both "turnitinbot" and "turnitin".

For a more thorough explanation please visit https://www.robotstxt.org/robotstxt.html.

Q: How can I completely exclude TurnitinBot from my site?
To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.

Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site.

#This is an example robots.txt file
User-agent: TurnitinBot
Disallow: /hide/     #Will disallow any url starting with /hide/

#This is an example robots.txt file
User-agent: TurnitinBot
Disallow: /            #Will disallow all urls on your site

Q: What IP Address does TurnitinBot come from?
Turnitin use a number of different crawlers and content indexing systems, all of which share the Agent Name "TurnitinBot", and originate from one of a number of static IP addresses - which our system might randomly assign to the crawler/indexer at any given time. The two main use cases are listed below.

Content Partner organisations/Crossref Members:
If you are a Crossref member using the Crossref Similarity Check (powered by iThenticate) service, then as per the Terms of the Service, you are obliged to make at least 90% of your DOI-assigned content available to Turnitin for indexing in our database; to protect this content from potential plagiarism and to allow all users to compare against your content, when using the Turnitin or iThenticate services. To facilitate this, you must deposit 'as_crawled' (full-text) URLs in your Crossref metadata, and allow access to these URLs to requests from the following Turnitin IP ranges:
199.47.87.132 to 199.47.87.135; 199.47.82.0 to 199.47.82.15

Similarly, for Content Partner organisations providing their metadata (including full-text URLs) to Turnitin over FTP, to allow Turnitin to crawl this content; these IP addresses and the Agent Name "Turnitin" and "TurnitinBot" should be given access to facilitate the indexing of this content.

General Webcrawl:
Turnitin also crawls publicly accessible content off the internet from newsites, blogs, academic websites, Open-Access repositories etc., as these sites could be the target of students/authors wanting to plagiarise content, and it is in the interests of the content owners and the users of our service for us to be able to compare against content hosted by these sites. In this instance, you may see our general webcrawler accessing your website from the following IP addresses:
199.47.80.0 to 199.47.87.255; 199.47.82.16 to 199.47.82.31

Please note, Crossref Members/Content Partners may also see activity from Turnitin's general webcrawler on their websites; if this causes any inconvenience or slowdown in traffic, this can be handled via robots.txt restrictions or by blocking just the latter IP range (199.47.80.0 to 199.47.87.131 and 199.47.87.136 to 199.47.87.255).


Q: Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?
There are two explanations for this. Either your site or another site has a link to your site which is incorrect, i.e. the page doesn't exist. Not knowing better we tried to follow this link generating a 404 error on your server. The other possibility is that TurnitinBot improperly parsed a link from a page.
Q: How can I contact you to report a problem?
If you still have questions or want to speak with us about our crawler's behavior you can contact us at crawler@turnitin.com. If you could please provide us with the following information it would help us determine what occurred when our crawler visited your site.

* A description of your question or problem.
* The IP address of the server which our crawler visited.
* The approximate time and date of the visit.
* A means to contact you if not by email.
* Entries from your server log(s) that pertain to our visit. In particular,the URLS we visited which triggered the problem.