Web scraping with C# and Selenium – part 2

29 Mar '21 by Dinko Gregorić

In part 1 of this blog post series we mentioned the most common approach to web scraping and its issues. We also made a small example on how to start web scraping with C#, Selenium and QueryStorm in Excel. Now we’ll expand on the example from part 1 and create a more useful web scraper.

Navigating to and scraping paginated items

It’s time to kick the web scraping up a notch. For instance, let’s scrape the names and prices of the top items on the home page, navigate to the laptops category and scrape all of the laptops as well.

Preparing the table

We should delete the current table rows as they are irrelevant. We can use ResultsTable.Clear() to delete all current table entries instead of deleting them by hand.

QueryStorm IDE with code to clear table.


In addition, we should also edit the ResultsTable by renaming the Results column to Product Name and by adding a new column named Price.

New table for web scraping with C# data.


Getting the price

To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a, price CSS selector – div.caption > h4.pull-right.price) inside of the parent item element.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

using (IWebDriver driver = new ChromeDriver())
{
    driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax");
    var topItems = driver.FindElements(By.CssSelector("div.thumbnail"));
    topItems.ForEach(i => 
        ResultsTable.AddRow(r => 
            {
                r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title");
                r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text;
            })
        );
}


Preventing CSS selector issues

Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there’s a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.

// Maximize browser
ChromeOptions options = new ChromeOptions();
options.AddArgument("--start-maximized");

using (IWebDriver driver = new ChromeDriver(options))
{
    …
}


Page navigation

The next step is navigating – first to the Computers page and then to the Laptops page.

We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.

Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it’s better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.

Clicking the button

Clicking is easy – we find the element and call its Click method.

driver.FindElement(By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a")).Click();


Waiting for a page to load


Waiting itself is not an issue as we can use the WebDriverWait class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.

var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
wait.Until(...);


In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go “stale” (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.

As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn’t complicated, so we can write the code for the conditions ourselves.

The WebDriverWait‘s Until method has a parameter of type Func<IWebDriver, TResult>. Therefore, we have to create a NewPageLoaded method that returns the specified delegate to the Until method. The code can look something like this…

private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator)
{
    return (driver) => 
        {
            bool oldElementStale = …;
            bool newElementVisible = …;
            
            return oldElementStale && newElementVisible; 
        };
    
}


To complete the NewPageLoaded method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until method. So, let’s define the methods to check for staleness and visibility.

Element staleness

An element is stale if any of these conditions are met:

  • The element is disabled
  • The element is missing (null)
  • Accessing the element throws a StaleElementReferenceException

private Func<IWebDriver, bool> IsElementStale(IWebElement element)
{
    return (driver) =>
        {
            try 
            {            
                return element == null || !element.Enabled;
            }
            catch (StaleElementReferenceException)
            {
                return true;
            }
        };
}


Element visibility

Also, an element is visible if:

  • The driver can find the element
  • The element is displayed

private Func<IWebDriver, bool> IsElementVisible(By elementLocator)
{
    return (driver) =>
        {
            try 
            {            
                var newPageElement = driver.FindElement(elementLocator);
                return newPageElement.Displayed ? true : false;
            }
            catch (StaleElementReferenceException)
            {
                return false;
            }
        };
}


Page loaded condition and navigation

Finally, the NewPageLoaded method looks like this:

private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator)
{
    return (driver) => 
        {
            bool oldElementStale = IsElementStale(oldPageElement).Invoke(driver);
            bool newElementVisible = IsElementVisible(newPageElementLocator).Invoke(driver);
            
            return oldElementStale && newElementVisible; 
        };
    
}


And once we decide what elements on the pages we’re going to use to identify if a new page has loaded, we’re ready to navigate to the Computers page and the Laptops page. I chose the following:

PageElementCSS selector
Home pageh1div.jumbotron > h1
Computers pageh2div.col-md-9 > h2
Laptops pagediv class=”btn-group pagination”div.btn-group.pagination


Now we can finally perform navigation:

// Create WebDriverWait instance
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));

// Navigate to Computers page and wait until it's loaded
var homePageElem = driver.FindElement(By.CssSelector("div.jumbotron > h1"));
driver.FindElement(By.CssSelector("#side-menu > li:nth-child(2)")).Click();
wait.Until(NewPageLoaded(homePageElem, By.CssSelector("div.col-md-9 > h2")));

// Navigate to Laptops page and wait until it's loaded
var computerPageElem = driver.FindElement(By.CssSelector("div.col-md-9 > h2"));
driver.FindElement(By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a")).Click();
wait.Until(NewPageLoaded(computerPageElem, By.CssSelector("div.btn-group.pagination")));


Scraping paginated laptop items

Since we’ve navigated to the Laptops page, we can now scrape the laptop items. We need a couple of things to do that.

First of all, we need a reference to the “Next” button element – by clicking it we can load the items, page by page (button CSS selector – button.btn.btn-default.next).

The second thing to have in mind is that we have to wait until the next page of items is loaded. Luckily, we’ve made a method to check the staleness of elements, so we can infer that a new page of items has loaded when the laptop items from the current page go stale.

And lastly, we should check whether the “Next” button is enabled or disabled, so we know if we’ve reached the last items page or not.

// Scrape all laptops (with pagination)
var nextPageBtn = driver.FindElement(By.CssSelector("button.btn.btn-default.next"));
bool isLastPage = false;
do
{
    isLastPage = !nextPageBtn.Enabled;
    var laptopItems = driver.FindElements(By.CssSelector("div.thumbnail"));
    laptopItems.ForEach(i => 
        ResultsTable.AddRow(r => 
            {
                r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title");
                r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text;
            })
    );
    if (!isLastPage)
    {
        nextPageBtn.Click();
        wait.Until(IsElementStale(laptopItems.First()));
    }
           
} while (!isLastPage);


We are almost done with our scraper! Let’s run the script with F5 and wait a couple of seconds. As a result of running the script, we can see 120 scraped products in our table. However, we should do one more thing – refactor the code a bit.

Finishing steps

First of all, the code for saving home items and laptop items is the same. Therefore, we can extract a method for saving items.

We can also extract a method for page navigation. We just need to pass different CSS selectors when calling the method.

And lastly, to keep the main part of the script nice and readable, we can do two things. We can create a new method just for scraping laptop items. Also, we can create a new method for direct navigation to the Laptops page.

We’re done!

Finally, here’s the full code for the tutorial:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

// Clear old results
ResultsTable.Clear();

// Maximize browser
ChromeOptions options = new ChromeOptions();
options.AddArgument("--start-maximized");

using (IWebDriver driver = new ChromeDriver(options))
{
    driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax");
    
    // Find top items and save them to the ResultsTable
    var topItems = driver.FindElements(By.CssSelector("div.thumbnail"));
    SaveItems(topItems);
    
    // Create WebDriverWait instance
    var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
    
    // Navigate to Laptops page and wait until it's loaded
    NavigateToLaptops(driver, wait);
    
    // Scrape all laptops (with pagination)
    ScrapeLaptops(driver, wait);
}

private void ScrapeLaptops(IWebDriver driver, WebDriverWait wait)
{
    var nextPageBtn = driver.FindElement(By.CssSelector("button.btn.btn-default.next"));
    bool isLastPage = false;
    do
    {
        isLastPage = !nextPageBtn.Enabled;
        var laptopItems = driver.FindElements(By.CssSelector("div.thumbnail"));
        SaveItems(laptopItems);
        if (!isLastPage)
        {
            nextPageBtn.Click();
            wait.Until(IsElementStale(laptopItems.First()));
        }
               
    } while (!isLastPage);
}

private void NavigateToLaptops(IWebDriver driver, WebDriverWait wait)
{
    // Navigate to Computers page and wait until it's loaded
    Navigate(driver, wait, 
        By.CssSelector("div.jumbotron > h1"), 
        By.CssSelector("#side-menu > li:nth-child(2)"),
        By.CssSelector("div.col-md-9 > h2"));
    
    // Navigate to Laptops page and wait until it's loaded
    Navigate(driver, wait, 
        By.CssSelector("div.col-md-9 > h2"), 
        By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a"),
        By.CssSelector("div.btn-group.pagination"));
}

private void Navigate(IWebDriver driver, WebDriverWait wait, By oldPageElementLocator, By elementToClick, By newPageElementLocator)
{
    var oldPageElem = driver.FindElement(oldPageElementLocator);
    driver.FindElement(elementToClick).Click();
    wait.Until(NewPageLoaded(oldPageElem, newPageElementLocator));
}

private void SaveItems(IReadOnlyCollection<IWebElement> items)
{
    items.ForEach(i => 
        ResultsTable.AddRow(r => 
            {
                r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title");
                r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text;
            })
        );
}

private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator)
{
    return (driver) => 
        {
            bool oldElementStale = IsElementStale(oldPageElement).Invoke(driver);
            bool newElementVisible = IsElementVisible(newPageElementLocator).Invoke(driver);
            
            return oldElementStale && newElementVisible; 
        };
    
}

private Func<IWebDriver, bool> IsElementStale(IWebElement element)
{
    return (driver) =>
        {
            try 
            {           
                return element == null || !element.Enabled;
            }
            catch (StaleElementReferenceException)
            {
                return true;
            }
        };
}

private Func<IWebDriver, bool> IsElementVisible(By elementLocator)
{
    return (driver) =>
        {
            try 
            {           
                var newPageElement = driver.FindElement(elementLocator);
                return newPageElement.Displayed ? true : false;
            }
            catch (StaleElementReferenceException)
            {
                return false;
            }
        };
}


In the next and final part of this web scraping tutorial, we’ll turn our script into a shareable workbook-application that any user with the QueryStorm Runtime can execute.