LOOK FOR ANYTHING IN QUOTES:
#############################

This goes well with my new article: amazing way to extract links from raw html

cat text.html | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’
The grep expresion is really simply

Look at this and open it up (remember how quotes are evaled in bash – when a quote is met it looks for the same type of quote to close it, single quotes can only open and close each other same with double quotes – double quotes can only open and close each other)

Space them apart, then look between each set of quotes
‘[‘   “‘”   ‘”][^”‘   “‘”   ‘]*[‘   “‘”   ‘”]’
 [     ‘     “][^”     ‘     ]*[     ‘     “]

Bring that last together, removing all of the spaces. So the final expression is
[‘”][^”‘]*[‘”]

Thats what grep will operate on

In grep that means:
[‘”] Look for ‘ or this ”
[^”‘]* Then look for alot of charcters that are not this ” or ‘ (quotes dont have quotes inside them – unless they are escaped which this doesnt take into account)
[‘”] Then for the ending quotes (logically they will be the ending quotes)

Note: the -o part in grep will only output the part it matched (not the rest of the line)
Try instead of -o: –color, that way the text is kept it will just color what it finds as red. Also -o is the same as –only-matching

Extract links out of html
##########################

wget -O – http://stackoverflow.com | grep -o ‘<a href=[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | sed -e ‘s/^<a href=[“‘”‘”‘]//’ -e ‘s/[“‘”‘”‘]$//’

cat text.html | grep -o ‘<a href=[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | sed -e ‘s/^<a href=[“‘”‘”‘]//’ -e ‘s/[“‘”‘”‘]$//’

Extract links if you know a part of it
#######################################

This only works if your link/url is surrounded by single or double quotes

cat text.html | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | grep “www.infotinks.com”

wget -O – http://www.infotinks.com | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | grep “www.infotinks.com”

Another method
###############

Only works on links that begin with https,http. This mess gets extra stuff sometimes.

cat text.html | grep –only-matching –perl-regexp “http(s?):\/\/[^ \”\(\)\<\>]*”

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *