How to Extract Links from a Web Page in Linux

Posted on January 16, 2019 at 3:39 pm

In Linux you can extract links from a web page easily:

Using Lynx

Extract links and save them to links.txt file:

lynx -dump | awk '/http/{print $2}' > links.txt

If you need to add numbers to lines, use nl command:

lynx -dump | awk '/http/{print $2}' | nl > links.txt

Using Pup (Golang)

First install golang and pup:

apt-get install golang
go get

Then use this command to extract all links from a file:

pup 'a[href] attr{href}' < yourfile.html

To save the extracted links to a file:

pup 'a.classname[href] attr{href}' < yourfile.html > links.txt

Using cURL, tr, grep

curl "" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq > links.txt

Here is an example output:


Using Wget, grep

wget -qO- | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[^/"]+' > links.txt

Receive updates via email

Other Posts

Updated Posts