Amazing way to extract links from raw html
########################################

This goes well with my other article: Extract Quotes

Extract html from links found best one here: http://stackoverflow.com/questions/1881237/easiest-way-to-extract-the-urls-from-an-html-page-using-sed-or-awk-only

cat index.html | grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//' <br>

NOTE: index.html can be a full website or just a subset of a website, the above will extract links pretty well, but as the forum says it is still limited by the limitations and capacities of regular expressions

NOTE: this can work in windows using grep.exe and sed.exe and cat.exe (which come in your regular cygwin package)

One thought on “Amazing way to extract links from raw html

Leave a Reply

Your email address will not be published. Required fields are marked *