Sunday, February 7, 2016

Simple Duplicate Content Checker

This program lets you detect duplicate content. Two articles are compared. Matched sentences will be highlighted.



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import javax.swing.*;
import javax.swing.event.*;
import javax.swing.text.*;
import javax.swing.text.Highlighter.*;
import java.awt.*;
import java.awt.event.*;
import java.util.*;

public class DuplicateDetector extends JFrame{
    JTextArea tarea1=null;
    JScrollPane sp1=null;
    JTextArea tarea2=null;
    JScrollPane sp2=null;
    JButton button=null;
    
    public DuplicateDetector(){
        super("Duplicate Content Detector");
        setSize(800,600);
        
        tarea1=new JTextArea();
        tarea1.setLineWrap(true);
        sp1=new JScrollPane(tarea1);
        sp1.setPreferredSize(new Dimension(600, 200) );
        tarea2=new JTextArea();
        tarea2.setLineWrap(true);
        sp2=new JScrollPane(tarea2);
        sp2.setPreferredSize(new Dimension(600, 200) );
        button=new JButton("Check");
        button.addActionListener(new ButtonHandler());
        
        setLayout(new GridBagLayout());
        GridBagConstraints gbc = new GridBagConstraints();
        gbc.gridx = 0;
        gbc.gridy = 0;
        
        add(sp1,gbc);
        gbc.gridy++;
        add(sp2,gbc);
        gbc.gridy++;
        add(button,gbc);
        
        setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        show();
    }
    
    class ButtonHandler implements ActionListener{
        public void actionPerformed(ActionEvent e){
            String st1=tarea1.getText();
            String st2=tarea2.getText();
            
            ArrayList<String> arrlist=new ArrayList<String>();
            StringTokenizer stoken=new StringTokenizer(st1,".");
            int counter=0;
            while(stoken.hasMoreTokens()){
                arrlist.add(counter,stoken.nextToken().trim());
                counter++;
            }
            
            for(int i=0;i<arrlist.size();i++){
                String stemp=arrlist.get(i);
                if(st2.contains(stemp)){
                    int a=st2.indexOf(stemp);
                    int b=a+stemp.length();
                    System.out.println(a+" "+b);
                    Highlighter highlighter = tarea2.getHighlighter();
                    HighlightPainter painter = new DefaultHighlighter.DefaultHighlightPainter(Color.pink);
                    try{
                        highlighter.addHighlight(a,b,painter);
                    }catch(Exception ex){
                        
                    }
                }
            }
        }
    }
    
    public static void main(String[] args){
        DuplicateDetector ap=new DuplicateDetector();
    }
}


JTextArea tarea1=null;
JScrollPane sp1=null;
JTextArea tarea2=null;
JScrollPane sp2=null;
JButton button=null;

The two textareas hold the articles that you want to compare. JScrollPane provides a scrollable view of the two components. When the button is clicked, duplicate content (if any) will be highlighted.

ArrayList<String> arrlist=new ArrayList<String>();

The duplicate content detection happens on a sentence per sentence basis. The ArrayList class will be used to store a set of sentences inside the first JTextArea after being extracted from the article.

StringTokenizer stoken=new StringTokenizer(st1,".");
int counter=0;
while(stoken.hasMoreTokens()){
     arrlist.add(counter,stoken.nextToken().trim());
     counter++;
}

The extraction process is performed based on certain tokens ("." or dot/period in this case as it indicates the end of a sentence). When a dot/period is found, the sentence that ends at that point will be inserted into ArrayList.

for(int i=0;i<arrlist.size();i++){
     String stemp=arrlist.get(i);
     if(st2.contains(stemp)){
          int a=st2.indexOf(stemp);
          int b=a+stemp.length();
          Highlighter highlighter = tarea2.getHighlighter();
          HighlightPainter painter = new DefaultHighlighter.DefaultHighlightPainter(Color.pink);
          try{
               highlighter.addHighlight(a,b,painter);
          }catch(Exception ex){
          }
     }
}

The duplicate content detection process is performed here. The variable "stemp" holds the sentences stored in ArrayList. The first sentence is retrieved using get() method.

if(st2.contains(stemp)){
..
}

It detects whether the sentence stored in the "stemp" variable has a match in the second article.

int a=st2.indexOf(stemp);
int b=a+stemp.length();

The two variables, "a" and "b", will contain the first and the last index of the matching sentences.

Highlighter highlighter = tarea2.getHighlighter();
HighlightPainter painter = new DefaultHighlighter.DefaultHighlightPainter(Color.pink);
try{
     highlighter.addHighlight(a,b,painter);
}catch(Exception ex){
}

This region highlights duplicate content that has been detected. Highlighter is an interface that allows you to change text color. You can see how the "a" and "b" variables mentioned above used here.

No comments:

Post a Comment