How to scrape SERP snippets with Python coding
Obtaining visibility in search search snippets is an excellent way to boost CTR and increase organic site traffic. As a quick refresher, below is a visual example of a SERP snippet for the query “paid search vs. seo”:
Snippets take up serious SERP real estate, often driving more traffic than a #1 ranking. There are several great hacks on how to optimize for Google snippets, but not many hacks on how to find snippet opportunities. The process is very manual. It usually requires the following:
- Search for keywords individually on Google
- Record the SERP snippets per keyword
- Identify which existing content/pages can break into those snippets
- Edit the content
Step 3 is the ultimate takeaway, but about 90% of the work goes into steps 1 and 2. However, what if you could automate finding the snippets and preserve your brainpower for winning the snippets? That’s where Python comes in. As a novice Python coder (If I can even call myself that), I’ve quickly realized that Python can significantly reduce the time for SERP research projects. Snippet research is one of those instances.
Rather than individually searching/recording Google snippets for hundreds of keywords, Python can do the grunt work so you can be more efficient with the time spent winning that valuable organic real estate.
Preamble: Change your VPN & don’t perform this operation on a network that multiple parties depend on
While search engine scraping is legal, Google can flag and deny any IP it suspects of bot-like behavior. Therefore, changing proxies is a prerequisite to successfully scraping. If you constantly use the same VPN and IP address for this practice, Google can store your information in its database of repeat offenders. While these bans are usually temporary, they still increase your likelihood of being denied again. This can be especially problematic if your work address is stored on a denylist and none of your coworkers can access Google.
Part 1: Get Python to read the document
First, we must list all the keywords we want to search in a Text file. We choose Text files because they’re minimal and easy for Python to handle. Save the file somewhere easy to access, as Python will need to access through your computer.
*WARNING: IF YOU ARE SAVING AS A RICH TEXT DOCUMENT ON MACBOOK (RFT), MAKE SURE TO EDIT THE FILE SO IT IS A .TXT. WE WANT THE SIMPLEST FILE POSSIBLE SO PYTHON DOESN’T WASTE ITS TIME READING HEADER ELEMENTS WITHIN A COMPLICATED WORD PROCESSOR.
Line 1: with open(“/Users/Desktop/Text_Doc_Python1.txt”, “r”) as f:
We instruct Python to open the list of queries from the file’s absolute path on your computer. In this example, the file was on my desktop and was housed under the path “/Users/Desktop/”. The file with my keywords was titled “Text_Doc_Python1.txt”. Hence, our open instruction is to open “/Users/Desktop/Text_Doc_Python1.txt” In that same line, we’re giving Python read permissions (‘r’). So now Python can both open and read the file. Lastly, I will define this whole operation of opening and reading the file as “f”. This way, I can refer back to it via a single letter rather than typing out that long file path whenever I want to use it.
Line 2: queries = [line.strip() for line in f]
Our keyword list might not be perfect, and there might be some stray spaces following our term. To account for this, we will use line.strip() to remove any stray spaces from before or after the KW; this ensures that the term you think is getting Googled is getting Googled. We will define these cleaned-up line items from our handy ‘f’ document as “queries.” This will represent each query we’re going to get a snippet from.
Line 3: print(queries)
For good measure, we’ll also print the queries in our Console to preview what exactly Python will be running through Google search.
Part 2: Import our packages
In Python, we work with modules (aka packages). A module is a piece of software that serves a very specific functionality. In Python, you access modules through the “import” command. For this specific project, we’ll be importing these packages. I’ll explain why:
The selenium webdriver module is what’s going to allow Python to perform searches in Google.
Requests will supplement webdriver by allowing Python to request a specific search URL from the server.
BeautifulSoup will let Python analyze that SERP and scrape elements (i.e. the snippet).
Random generates a random number within a certain defined range. We use random so that each request has a different server request time. If we run hundreds of requests that have the same exact delay time in between each search, Google will assume you are a bot and likely block your IP.
Time is required to define the sleep period in between searches before Python performs another one. Time works in tandem with the random module in this project.
The csv module simply allows Python to interact with and write csv files. When Python is done crawling our list items in Google, we’ll want it to package up the results in a nice CSV document that is ready for analysis.
Lastly, even though delay_seconds isn’t a package, we’re going to take care of defining this variable now because we’ll need it later. What we’re doing is using our newly imported random package to give us a random integer between 100 and 500. We’re then dividing that number by 100. So what does that do? It gives us any decimal number between 1.00 and 5.00, which will be used as the amount of seconds our program will wait in between crawls. We’ll explain why that matters later.
Part 3: Set up Chrome Driver
For the selenium webdriver module to work, Python needs an application to run the searches through. To get this, you will need to install ChromeDriver from https://chromedriver.chromium.org.
Once you have chromium installed, you will need to find where it’s located on your computer.
Much like we did with our open (“/Users/Desktop/Text_Doc_Python1.txt” command, where we needed to tell Python where the file was, what we’re accessing, and what to do with it, we’re doing the same with Chromedriver. We’re telling Python where the browser is and where these searches will be performed.
In this example, the file was found in /Users/Downloads/ and the file was called “chromedriver 3”. Ergo, we are defining our chromedriver variable as “/Users/Downloads/chromedriver 3”
While we’ve defined which application we’ll use to search things, we haven’t yet set a command to search things. driver = webdriver.Chrome(chromedriver) commands Python to automate a search process via webdriver using the chromedriver browser that we just assigned above. We’ve taken this process of automating a search and defined it as driver.
So to recap, webdriver is our automation and chromedriver is our Google Chrome application where searches will be automated.
Part 4: Create a new file where our scraped results will go
Before we finally start automating searches, we want to make sure that this data will be packaged up in a file for us once it’s done. The data won’t do much good sitting in a Python console: we need it in a CSV file that we can manipulate and analyze.
Line 1: with open(‘innovators.csv’, ‘w’) as file:
To do this, we will pull the same open command we used to access our list of queries earlier. But there’s a core difference with how we’re using it. On the query list, we just wanted Python to read the file (hence the “r” in with open(“/Users/Desktop/Text_Doc_Python1.txt”, “r.”). We want Python to write a file, so we’re going with ‘w’ instead. This whole process of writing to the file I’ve inexplicably named ‘innovators.csv’ will be defined as file.
Line 2: fields = [‘Query’, ‘Snippet’]
We will give this file header values of ‘Query’ and ‘Snippet.’ We want to easily show a third party that “Column A is the search keyword, Column B is the snippet result.” These two headers are being packaged up into an easy variable named fields.
Line 3: writer = csv.writer(file)
I’m now inventing a variable called “writer”, where we will write onto the file we defined before.
Line 4: writer.writerow(fields)
Now that we’re accessing the file, I can write my fields onto my csv document. When this script runs and writes a CSV file, my columns will have a header element now. So far, that’s all the document has, though.
Part 5: Perform the searches
Line 1: for item in queries:
In our first line, we’re now establishing an important Python function called a “ for loop,” which essentially repeats an operation for us. Here, we’re simply instructing our program that we’re about to perform an operation for each query (or item ) from our queries variable (our full list of queries we defined earlier).
Line 2: updated_query = item.replace(” “, “+”)
For each query in our doc, we’re now mutating each one into a Google search URL. We know that when you search for something in Google, the URL we get back follows a formula:
Query: “test query”
Google URL: https://www.google.com/search?q=test+query
Easy enough! First, we must replace each space with a “+” sign. As you see in the Google URL above, “test query” is transformed into “test+query” in the Google search URL. We then apply that formula to each query (or item ) and redefine it as updated_query. To recap:
Item = “test query”
updated_query = “test+query”
Line 3: google_url = “https://www.google.com/search?q=” + updated_query
Here, we’re now making another variable called google_url, simply the Google URL search prefix of https://www.google.com/search?q= followed by our URL-ready updated_query.
In these few steps, we’ve now changed “test query” into https://www.google.com/search?q=test+query for each one of our keywords.
Line 4: driver.get(google_url)
In Step 3, we defined the action of performing a search on our Google driver as driver. So now that we have both our Google search operation set up and the specific URL we need to be searched, we’re just instructing driver to perform its operation with our google_url.
Line 5: time.sleep(delay_seconds)
You may recall that in Part 1, we imported our packages but also made the variable delay_seconds generate a decimal number between 1 and 5: delay_seconds = random.randint(100, 500)/100 For each time this script runs, a different number will generate and is assigned as the time.sleep value. time.sleep is the amount of seconds that the program will wait until performing another search. So after each search, the program will wait somewhere between 1.00 and 5.00 seconds before performing the next search.
Why do we do that? If our program waits for the same exact amount of time in between dozens of searches, Google will figure out pretty quickly that this is a bot performing the search, which increases odds of getting blocked by Google. We use the random value for time.sleep to anonymize ourselves to prevent this from happening.
Part 6: Pull the snippets
Line 1: soup = BeautifulSoup(driver.page_source,’lxml’) The BeautifulSoup package we imported earlier allows us to pull HTML from a live URL. Meanwhile, driver has a built-in page_source attribute that helps our program to parse the HTML of a selected page ( ‘lxml’ is said parcer). We’re defining this whole operation as soup .
Line 2: for s in soup.find_all(id=”res”): We’re running another for loop for every scraped Google result. For every scrape we perform (now defined as soup ), we want to find all instances of when the page code has an id value of ”res” (hence (id=”res”) ).
Why? Because the actual code within Google’s SERP has defined “res” as a DIV id:
You could choose from a number of other ids found on Google’s SERP, but we went with “res” here.
Lines 3 & 4: s = s.text.replace(‘Search ResultsFeatured snippet from the web’, ”).split(“›”) ns = s.replace(‘Search ResultsWeb results’, ‘No snippet : ‘)
Here we’re just cleaning up the result from our scrape ( s ) or labeling if our result has returned no snippets. ns is the variable that will live on because it is searching for a word combination that will ultimately tell us if a snippet exists ( ‘Search ResultsFeatured snippet from the web’ ) or doesn’t exist ( ‘Search ResultsWeb results’ ). If a snippet exists, we’ll get the scraped result back. If it doesn’t exist, the line for that query will read ‘No snippet : ‘ .
Line 5 & 6: data = item, ns print(data)
Lastly, we’re just making a variable that organizes the data we want to get back. As you may recall, item is from the beginning of our Part 5 for loop and is the original query we used for our scrape. ns is our scrape result (which will either yield the scrape result or “No Snippet”).
The print(data) command will then display for us what those results will be.
Part 7: Get the scrape results written to our file, ready to analyze
The scraping is done! Now we need to get it into a document we can analyze. You may recall the following code from Part 4.
So since writer has already been defined as the action of writing onto our original ‘innovators.csv’ doc, and since we already have ‘Query’ and ‘Snippet’ written onto the doc as column headers (via writer.writerow(fields) ), we then invoke the writer.writerow command again to write our data result ( item , which is essentially the query, and ns) which is the snippet result) directly below the appropriate headers in the correct columns.
Once all the loops have run and are written into our document, we then use file.close() to close the file.
Our first query “How do you get health insurance in Vermont” returned no snippet at the time of the search. Meanwhile, “How do you get health insurance in West Virginia” did, and we can see the result along with the URL at the very end.
Now you know how to scrape featured snippets from Google! You can likely make small tweaks to scrape for other features such as People Also Ask fields, but this is a good starting point for your snippet scraping needs.
Learn more about our SEO services
More Insights?View All Insights
Personalize your experience