Implement an anti-plagiarism checker with Redis

Implement an anti-plagiarism software using Redis as a vector database

I have written on the different problems that can be resolved using the powerful features of Redis as a vector database:

I want to present a simple proof of concept of an anti-plagiarism software in this article. Let’s say that you want to verify if specific documents or images are copied from your website (and you would like to protect your intellectual property, be paid or at least acknowledged). You can easily index your data using Redis and Vector Similarity Search (VSS).

You can find the code of OAPS, the Open Anti-Plagiarism Software, and use it to develop and test your implementation.

Plagiarism of documents

We start our example by creating an index of vectors as follows:

FT.CREATE oaps_txt_idx 
ON JSON 
PREFIX 1 oaps:seq: 
SCHEMA $.sentence AS sentence TEXT 
$.embedding AS embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE

In the sample code in the repo you will learn how to create an index in Python.

Machine learning models produce embeddings of texts of limited size. As an example, the model all-MiniLM-L12-v1 will truncate input text longer than 128 words. It makes sense to split a document into parts (in addition, the verification of documents will be more precise). Here I have decided to split documents into sentences (but you could test paragraphs). Once the sentences are identified, they pass through the model and the vectors are produced. Here is a Python snippet that accomplishes the purpose:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1')
vector = model.encode(text).tolist()

Now that the model has been produced, it is the time to store it in Redis. Redis can store vectors in Hash or JSON data structures. For this example, I have chosen the JSON data structure and I store a sequential number of the sentence part of the document, the sentence itself, and the vector. Note: I am using the redis-py client library.

        
import redis

sentence = {
    'seq': seq,
    'sentence': txt_sentence,
    'embedding': vector
}
        
conn.json().set("oaps:seq:{}:{}".format(pk,seq), '$', sentence)

Once done with the indexation of the documents, the database is ready for testing. Following the same approach, we will split test documents (documents that must be verified) into sentences, producing a vector per sentence, and testing the similarity of the single sentence to the sentences modeled in the database. Note that we can configure the tolerance to restrict the results (using an epsilon coefficient).

Plagiarism of images

Extending this example to images is straightforward. Using a suitable model to extract a vector embedding from an image, we will model our database of images as a database of vectors, and perform vector similarity search to find out if an image is being used without my consent.

Wrapping up

You can easily index unstructured data using the many available free data models to protect intellectual property. Audio, image, text files can be vectorized and indexed by Redis using the desired KNN algorithm (FLAT or HNSW). Remember to clone the repository and give the demo a try.

Leave A Comment