Thursday, February 11, 2016

Extracting URLs From A Webpage


This program demonstrates the use of Jsoup, a third party library, to fetch hyperlinks from a webpage. It also displays the number of external links contained on the page. For this purpose, the jsoup.jar file must be downloaded and included into the project.

Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.net.*;

public class LinkGrabber extends JFrame{
    JLabel l;
    JTextField tf;
    JTextArea ta;
    JScrollPane sp;
    JTextArea ta2;
    JButton b;
    ArrayList<String> al;
    
    public static void main(String[] args){
        LinkGrabber ap=new LinkGrabber();
    }
    
    public LinkGrabber(){
        super("Link Grabber");
        setSize(800,400);
        
        l=new JLabel("URL:");
        ta=new JTextArea();
        sp = new JScrollPane(ta);
        sp.setPreferredSize(new Dimension(600, 200));
        ta2=new JTextArea();
        ta2.setPreferredSize(new Dimension(600, 50));
        ta2.setEditable(false);
        tf=new JTextField(50);
        b=new JButton("Show");
        b.addActionListener(new LinkGrabber.ButtonHandler());
        al=new ArrayList<String>();
        
        setLayout(new GridBagLayout());
        GridBagConstraints gbc = new GridBagConstraints();
        gbc.insets = new Insets(10,0,0,0);
        gbc.gridx=0;
        gbc.gridy=0;
        add(l,gbc);
        gbc.gridx=1;
        add(tf,gbc);
        gbc.gridy++;
        add(sp,gbc);
        gbc.gridy++;
        add(ta2,gbc);
        gbc.gridy++;
        add(b,gbc);
        
        setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
        show();
    }
    
    class ButtonHandler implements ActionListener{
        public void actionPerformed(ActionEvent e){
            String st=e.getActionCommand();
            if(st.equals("Show")){
                try{
                   Document doc=Jsoup.connect(tf.getText()).get();
                    
                    int counter=0;
                    int linksout=0;
                    ta.setText("");
                    Elements links=doc.select("a");
                    for (Element el : links) {
                        al.add(el.attr("href")+"\n");
                        ta.append(al.get(counter));
                        String dom=al.get(counter);
                        if(dom.contains("http://")||dom.contains("https://")){
                            URL link=new URL(tf.getText());
                            String hostname=link.getHost();
                            if(!dom.contains(hostname)){
                                linksout++;
                            }
                        }
                        counter++;
                    }
                    ta2.setText("");
                    ta2.setText("External Links:"+linksout);
                }catch(Exception ex){
            
                }
            }
        }
    }    
}

JTextField tf;
JTextArea ta;
JTextArea ta2;
JButton b;
ArrayList<String> al;

JTextField is used for user input. The URL of the webpage that will be extracted is typed here. The first text area "ta" shows the extracted URLs. The second text area "ta2" displays the number of outbound/external links on the webpage. When the button "b" is clicked, the process begins.

Document doc=Jsoup.connect(tf.getText()).get();

Basically, this line of code gets a HTML document from the web.

int counter=0;

This variable holds the number of URLs (internal and external) on the webpage.

int linksout=0;

This variable holds the number of URLs (external) on the webpage.

Elements links=doc.select("a");
for (Element el : links) {
.
.
.
}

Put it simply, this loop iterates through all links on the webpage.

al.add(el.attr("href")+"\n");

All the links, one by one, are added to an array list.

ta.append(al.get(counter));

The URL that has just been retrieved is displayed on the text area. From here, the retrieval process has actually ended.

String dom=al.get(counter);

The variable "dom" holds the current URL. It will be checked lated to determine whether it is an internal or external link.

if(dom.contains("http://")||dom.contains("https://")){
        URL link=new URL(tf.getText());
        String hostname=link.getHost();
        if(!dom.contains(hostname)){
                linksout++;
         System.out.println(dom);
 }
}

If the URL contains http:// or https://, there is a chance that it is an external link since internal links may not contain http:// or https://.

URL link=new URL(tf.getText());
String hostname=link.getHost();

The URL class is utilized to get the hostname. The result is stored in the String variable "hostname".

if(!dom.contains(hostname)){
 linksout++;
}

If the URL is different than the hostname, this means that it is an external link. The "linksout" variable is added by 1.

ta2.setText("External Links:"+linksout);

The number of outbound links is displayed on the second text area.

No comments:

Post a Comment