Category: Bot

  • Building a DuckDuckGo Web Crawling Bot in Python

    Building a DuckDuckGo Web Crawling Bot in Python

    I have been working on a project, and I wanted to build a simple web crawler to grab subdomains given a parent domain. It turns out that this is pretty simple using DuckDuckGo and BeautifulSoup to parse the HTML.

    Using Google can be a pain because all the HTML is pretty much rendered by JavaScript, which makes it difficult to be parsed. DuckDuckGo makes it much easier by offering an HTML subdomain that provides mostly HTML that can be easily parsed.

    Warning: If you run the crawler too many times, then DuckDuckGo will put a 15 to 30 minute block on your IP address. The solution is to sleep between calls or use a sufficient amount of proxy servers.

    The Code

    import requests
    import bs4
    
    def duckduckgo_search_results(q):
        # Make the bot look like it's using a browser.
        headers = {
            'User-Agent': "Mozilla/5.0"
        }
    
        # Search duckduckgo via HTML
        url = "https://html.duckduckgo.com/html/?q=" + str(q)
    
        # Grab the HTTP Response object
        resp = requests.get(url, headers=headers)
    
        pages = []
    
        # If the status code is 200, then continue
        if resp.status_code == 200:
            # Parse the html
            soup = bs4.BeautifulSoup(resp.text, "html.parser")
    
            # Grab links from result__url class
            result_url_tags = soup.find_all(class_="result__url")
    
            # grab the text from each tag.
            for tag in result_url_tags:
                pages.append(tag.get_text())
        
        return pages
    
    
    def main():
        pages = duckduckgo_search_results("site:google.com")
        print("Page Count: " + str(len(pages)))
    
    if __name__ == "__main__":
        main()

    You’ll notice that the only header set is the User-Agent, which makes the bot appear as a browser.

    Once the query and results are returned back by using the requests module, the BeautifulSoup html.parser is used to parse the HTML returned by DuckDuckGo.

    Afterwards, it is easy to use selectors to extract URLs from the search results since each is wrapped with a CSS class of result__url.

    Final Words

    In this blog post, you learned how to crawl for web pages using DuckDuckGo and a bot made from python.