Plagiarism is to copy a source, without proper recognition and authorization. Some degree of similarity is accepted: it can be a coincidence, an involuntary emulation, or an accepted practice. But an important degree of similarity is not accepted and constitutes an intellectual property violation.
There is no exact definition of when a work incurs in plagiarism, and in these times of Internet and easy cut-and-paste we need to be more precise.
Are 4 words together plagiarism? If I say: “I have a dream”, you probably think in Luther King, but there will be no serious accusation. Thus, 4 words are not enough for plagiarism. Then, how many are needed? How about 20?
Another issue is the “proper acknowledgement”. If I do not remember the source I can say “Someone said”. Is that enough? Some authors will demand complete quotation for partial use of a paper.
There are a number of plagiarism detection products, like Copyscape.com and Turnitin.com and they do not specify when there is plagiarism. They just order possible copies in order of similarity. The problem is, when a source is modified, the limit between Derivative Work and a Different Work is blurred.
An academic paper by (Chen) describe a method to compare and measure similarity between sequences, whether they are software, genes or text. However, it does not address the Internet plagiarism issue. Work by (Ceska) and (Mozgovoy) investigates different preprocessing methods and algorithms to improve plagiarism detection, but do not propose a simple plagiarism index.
So far, no valid definition of plagiarism exists, other than the subjective judgment of a judge. It is accepted that the more different words from the original, the less chances of being accused of plagiarism. Also, there is less chance to trigger detection tools.
It would be possible to come up with a formula that would define the limit of plagiarism. It could use these variables:
Total number of words in the supposed derivative: d
Number of identical words in exact sequence:
# of Identical 6 word phrases: I6
# of Identical 7 word phrases: I7
# of Identical 8 word phrases: I8 …
# of Identical n word phrases: In
KI6 = coefficient of seriousness for a 6-word phrase infringement. Suggested value: 10
KI7 = coefficient of seriousness for a 7-word phrase infringement. Suggested value: 20 …
Plagiarism index: ( I6 x KI6 + I7 x KI7 + I8 x KI8 + … In x KIn ) / d
Explanation: the longer the phrase similarity, the more serious the infringement is. The longer is the derivative work, the less important are the copied phrases. The replacement of words for synonyms in a copied text, will create several short phrase infringements instead of a long one.
Plagiarism detection: Using this formula, it would be easy to create a software that will compare two writings and establish the degree of similarity, providing the “Plagiarism Index” and triggering an alarm when the index goes over a pre-established limit.
The exact formula and the accepted values of the Plagiarism Index are open to discussion.
Shared Information and Program Plagiarism Detection Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, Amit Seker_ University of California, Santa Barbara http://bioinformatics.uwaterloo.ca/papers/04sid.pdf
The Influence of Text Pre-processing on Plagiarism Detection Zdenek Ceska, Chris Fox http://cswww.essex.ac.uk/staff/foxcj/papers/C-Fox-RANLP2009-paper.pdf
Maxim Mozgovoy Enhancing Computer-Aided Plagiarism Detection http://joypub.joensuu.fi/publications/dissertations/mozgovoy_plagiarism/mozgovoy.pdf