Turnitin launches iThenticate 2.0 to help maintain integrity of high stakes content with AI writing detection
Learn more
Blog   ·  

How to bring deeper meaning to the Similarity Score

Turnitin Originality

Gretchen Hanson
Gretchen Hanson






By completing this form, you agree to Turnitin's Privacy Policy. Turnitin uses the information you provide to contact you with relevant information. You may unsubscribe from these communications at any time.


There is a lot of confusion around Turnitin’s similarity score. What’s the right percentage? Is this percent too high? Is that one too low? Ultimately, this decision is up to each instructor. In this post, we want to provide some insights to help instructors make informed decisions around the similarity score. What might help clarify the percentage score is to detail how the Similarity Report gets created - and some tips to move beyond the percentage and towards evaluating the rich information in the Similarity Report.

How Does Turnitin Find Text Similarity?

When a student submits their document, things start to get exciting, programmatically speaking. In a matter of milliseconds, Turnitin does some pretty amazing things with the document.

First, all of the words are broken into phrases and common words (such as “and,” “or,” the,” etc) are removed. Each phrase gets stored with its own unique “fingerprint” ID. Then we compare these phrase IDs to our content databases to determine if there are any matches.

Our content databases include more than 1.2 billion student submissions, 70 billion current and archived web pages, and 180 million articles from the top academic journals and sources. Turnitin’s breadth and depth of content continues to grow by millions every day.

Every submitted document can contain up to 80,000 phrase IDs; each phrase ID is compared to 7 trillion possible phrase matches coming from the content databases. If Turnitin software spots potential matches, the software then applies natural language processing and strict matching heuristics to limit the number of false positives and generate the most accurate report. There are a few other fancy things that are done in tandem -- like looking for hidden text or replaced characters -- to improve the results, but let’s keep it simple for now.

Within 10 seconds, all of the above happens, resulting in a Similarity Report.

(Interesting note: Turnitin generates about 20 reports per second. And on the busiest days can receive more than 1 million submissions!)

How to Evaluate the Similarity Report

Ultimately, the Similarity Report provides information about all of the sources where Turnitin found matching phrases/text back to the instructor. Each match is highlighted and associated with the most relevant/important source. As you might imagine, with such a massive database of content being compared against any given match, there may be multiple sources. So for each match, we determine which source is the most important or significant and cite that as the primary source. For this reason, even if instructors choose to exclude that source, there might be an additional source with the match--so your overall similarity score may not change.

So, how do we recommend evaluating matches within the Similarity Report? First, be aware that there are options to refine the matches you see. You can choose to remove quotations, bibliographies, small matches of under a certain number of words, sources, or even whole databases. For example, if you do not want to compare against other student submissions, educators can exclude the submitted works repository.

Instructors can also choose to exclude a specific text match and let Turnitin know why it wasn’t a valuable match. We will use this data to continue to train our matching algorithm to make it smarter and produce even better similarity results.

All of these exclusions can be applied before the similarity report is run, as well as while you’re using the report. If you’d like to save your exclusions, you can print or PDF your report and the dynamic changes will be captured.

Using these exclusions can help you narrow in on the most important matches. It can help you identify where there might have simply been errors in citation, so you can provide that guidance and feedback to the student. Depending on what you want to evaluate, the similarity report can help you narrow in on where you should focus.

The similarity score is simply a reflection of the percentage of similar words. The power in the Similarity Report, however, is to help educators identify issues, focus on areas of excellence or growth, and guide feedback to help your students improve their writing and keep integrity at the core of all they do.

Watch this quick video to learn more about the similarity report and its exclusion options. If you want additional help, these guides for students and instructors on how to interpret the similarity report are a great resource.