How to Extract Links from a Web Page in Linux

Posted on January 16, 2019 at 3:39 pm

In Linux you can extract links from a web page easily:

Using Lynx

Extract links and save them to links.txt file:

lynx -dump http://www.google.com | awk '/http/{print $2}' > links.txt

If you need to add numbers to lines, use nl command:

lynx -dump http://www.google.com | awk '/http/{print $2}' | nl > links.txt

Using Pup (Golang)

https://github.com/EricChiang/pup

First install golang and pup:

apt-get install golang
go get github.com/ericchiang/pup

Then use this command to extract all links from a file:

pup 'a[href] attr{href}' < yourfile.html

To save the extracted links to a file:

pup 'a.classname[href] attr{href}' < yourfile.html > links.txt

Using cURL, tr, grep

curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq > links.txt

Here is an example output:

//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com

Using Wget, grep

wget -qO- http://google.com/ | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[^/"]+' > links.txt

Receive updates via email

Other Posts

Updated Posts