I have been working on a project, and I wanted to build a simple web crawler to grab subdomains given a parent domain. It turns out that this is pretty simple using DuckDuckGo and BeautifulSoup to parse the HTML.
Using Google can be a pain because all the HTML is pretty much rendered by JavaScript, which makes it difficult to be parsed. DuckDuckGo makes it much easier by offering an HTML subdomain that provides mostly HTML that can be easily parsed.
Warning: If you run the crawler too many times, then DuckDuckGo will put a 15 to 30 minute block on your IP address. The solution is to sleep between calls or use a sufficient amount of proxy servers.
The Code
import requests
import bs4
def duckduckgo_search_results(q):
# Make the bot look like it's using a browser.
headers = {
'User-Agent': "Mozilla/5.0"
}
# Search duckduckgo via HTML
url = "https://html.duckduckgo.com/html/?q=" + str(q)
# Grab the HTTP Response object
resp = requests.get(url, headers=headers)
pages = []
# If the status code is 200, then continue
if resp.status_code == 200:
# Parse the html
soup = bs4.BeautifulSoup(resp.text, "html.parser")
# Grab links from result__url class
result_url_tags = soup.find_all(class_="result__url")
# grab the text from each tag.
for tag in result_url_tags:
pages.append(tag.get_text())
return pages
def main():
pages = duckduckgo_search_results("site:google.com")
print("Page Count: " + str(len(pages)))
if __name__ == "__main__":
main()
You’ll notice that the only header set is the User-Agent, which makes the bot appear as a browser.
Once the query and results are returned back by using the requests module, the BeautifulSoup html.parser is used to parse the HTML returned by DuckDuckGo.
Afterwards, it is easy to use selectors to extract URLs from the search results since each is wrapped with a CSS class of result__url.
Final Words
In this blog post, you learned how to crawl for web pages using DuckDuckGo and a bot made from python.
