In part 1 of this blog post series we mentioned the most common approach to web scraping and its issues. We also made a small example on how to start web scraping with C#, Selenium and QueryStorm in Excel. Now we’ll expand on the example from part 1 and create a more useful web scraper.
Navigating to and scraping paginated items
It’s time to kick the web scraping up a notch. For instance, let’s scrape the names and prices of the top items on the home page, navigate to the laptops category and scrape all of the laptops as well.
Preparing the table
We should delete the current table rows as they are irrelevant. We can use ResultsTable.Clear()
to delete all current table entries instead of deleting them by hand.
In addition, we should also edit the ResultsTable
by renaming the Results
column to Product Name
and by adding a new column named Price
.
Getting the price
To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail
). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a
, price CSS selector – div.caption > h4.pull-right.price
) inside of the parent item element.
using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using (IWebDriver driver = new ChromeDriver()) { driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax"); var topItems = driver.FindElements(By.CssSelector("div.thumbnail")); topItems.ForEach(i => ResultsTable.AddRow(r => { r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title"); r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text; }) ); }
Preventing CSS selector issues
Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there’s a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.
// Maximize browser ChromeOptions options = new ChromeOptions(); options.AddArgument("--start-maximized"); using (IWebDriver driver = new ChromeDriver(options)) { ⦠}
Page navigation
The next step is navigating – first to the Computers page and then to the Laptops page.
We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.
Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it’s better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.
Clicking the button
Clicking is easy – we find the element and call its Click
method.
driver.FindElement(By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a")).Click();
Waiting for a page to load
Waiting itself is not an issue as we can use the WebDriverWait
class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.
var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10)); wait.Until(...);
In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go âstaleâ (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.
As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers
NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn’t complicated, so we can write the code for the conditions ourselves.
The WebDriverWait
‘s Until
method has a parameter of type Func<IWebDriver, TResult>
. Therefore, we have to create a NewPageLoaded
method that returns the specified delegate to the Until
method. The code can look something like thisâ¦
private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator) { return (driver) => { bool oldElementStale = â¦; bool newElementVisible = â¦; return oldElementStale && newElementVisible; }; }
To complete the NewPageLoaded
method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until
method. So, let’s define the methods to check for staleness and visibility.
Element staleness
An element is stale if any of these conditions are met:
- The element is disabled
- The element is missing (null)
- Accessing the element throws a
StaleElementReferenceException
private Func<IWebDriver, bool> IsElementStale(IWebElement element) { return (driver) => { try { return element == null || !element.Enabled; } catch (StaleElementReferenceException) { return true; } }; }
Element visibility
Also, an element is visible if:
- The driver can find the element
- The element is displayed
private Func<IWebDriver, bool> IsElementVisible(By elementLocator) { return (driver) => { try { var newPageElement = driver.FindElement(elementLocator); return newPageElement.Displayed ? true : false; } catch (StaleElementReferenceException) { return false; } }; }
Page loaded condition and navigation
Finally, the NewPageLoaded
method looks like this:
private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator) { return (driver) => { bool oldElementStale = IsElementStale(oldPageElement).Invoke(driver); bool newElementVisible = IsElementVisible(newPageElementLocator).Invoke(driver); return oldElementStale && newElementVisible; }; }
And once we decide what elements on the pages we’re going to use to identify if a new page has loaded, we’re ready to navigate to the Computers page and the Laptops page. I chose the following:
Page | Element | CSS selector |
Home page | h1 | div.jumbotron > h1 |
Computers page | h2 | div.col-md-9 > h2 |
Laptops page | div class=”btn-group pagination” | div.btn-group.pagination |
Now we can finally perform navigation:
// Create WebDriverWait instance var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10)); // Navigate to Computers page and wait until it's loaded var homePageElem = driver.FindElement(By.CssSelector("div.jumbotron > h1")); driver.FindElement(By.CssSelector("#side-menu > li:nth-child(2)")).Click(); wait.Until(NewPageLoaded(homePageElem, By.CssSelector("div.col-md-9 > h2"))); // Navigate to Laptops page and wait until it's loaded var computerPageElem = driver.FindElement(By.CssSelector("div.col-md-9 > h2")); driver.FindElement(By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a")).Click(); wait.Until(NewPageLoaded(computerPageElem, By.CssSelector("div.btn-group.pagination")));
Scraping paginated laptop items
Since we’ve navigated to the Laptops page, we can now scrape the laptop items. We need a couple of things to do that.
First of all, we need a reference to the âNextâ button element – by clicking it we can load the items, page by page (button CSS selector – button.btn.btn-default.next
).
The second thing to have in mind is that we have to wait until the next page of items is loaded. Luckily, we’ve made a method to check the staleness of elements, so we can infer that a new page of items has loaded when the laptop items from the current page go stale.
And lastly, we should check whether the âNextâ button is enabled or disabled, so we know if we’ve reached the last items page or not.
// Scrape all laptops (with pagination) var nextPageBtn = driver.FindElement(By.CssSelector("button.btn.btn-default.next")); bool isLastPage = false; do { isLastPage = !nextPageBtn.Enabled; var laptopItems = driver.FindElements(By.CssSelector("div.thumbnail")); laptopItems.ForEach(i => ResultsTable.AddRow(r => { r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title"); r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text; }) ); if (!isLastPage) { nextPageBtn.Click(); wait.Until(IsElementStale(laptopItems.First())); } } while (!isLastPage);
We are almost done with our scraper! Let’s run the script with F5 and wait a couple of seconds. As a result of running the script, we can see 120 scraped products in our table. However, we should do one more thing – refactor the code a bit.
Finishing steps
First of all, the code for saving home items and laptop items is the same. Therefore, we can extract a method for saving items.
We can also extract a method for page navigation. We just need to pass different CSS selectors when calling the method.
And lastly, to keep the main part of the script nice and readable, we can do two things. We can create a new method just for scraping laptop items. Also, we can create a new method for direct navigation to the Laptops page.
We’re done!
Finally, here’s the full code for the tutorial:
using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using OpenQA.Selenium.Support.UI; // Clear old results ResultsTable.Clear(); // Maximize browser ChromeOptions options = new ChromeOptions(); options.AddArgument("--start-maximized"); using (IWebDriver driver = new ChromeDriver(options)) { driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax"); // Find top items and save them to the ResultsTable var topItems = driver.FindElements(By.CssSelector("div.thumbnail")); SaveItems(topItems); // Create WebDriverWait instance var wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10)); // Navigate to Laptops page and wait until it's loaded NavigateToLaptops(driver, wait); // Scrape all laptops (with pagination) ScrapeLaptops(driver, wait); } private void ScrapeLaptops(IWebDriver driver, WebDriverWait wait) { var nextPageBtn = driver.FindElement(By.CssSelector("button.btn.btn-default.next")); bool isLastPage = false; do { isLastPage = !nextPageBtn.Enabled; var laptopItems = driver.FindElements(By.CssSelector("div.thumbnail")); SaveItems(laptopItems); if (!isLastPage) { nextPageBtn.Click(); wait.Until(IsElementStale(laptopItems.First())); } } while (!isLastPage); } private void NavigateToLaptops(IWebDriver driver, WebDriverWait wait) { // Navigate to Computers page and wait until it's loaded Navigate(driver, wait, By.CssSelector("div.jumbotron > h1"), By.CssSelector("#side-menu > li:nth-child(2)"), By.CssSelector("div.col-md-9 > h2")); // Navigate to Laptops page and wait until it's loaded Navigate(driver, wait, By.CssSelector("div.col-md-9 > h2"), By.CssSelector("#side-menu > li.active > ul > li:nth-child(1) > a"), By.CssSelector("div.btn-group.pagination")); } private void Navigate(IWebDriver driver, WebDriverWait wait, By oldPageElementLocator, By elementToClick, By newPageElementLocator) { var oldPageElem = driver.FindElement(oldPageElementLocator); driver.FindElement(elementToClick).Click(); wait.Until(NewPageLoaded(oldPageElem, newPageElementLocator)); } private void SaveItems(IReadOnlyCollection<IWebElement> items) { items.ForEach(i => ResultsTable.AddRow(r => { r.Product_Name = i.FindElement(By.CssSelector("h4 > a")).GetAttribute("title"); r.Price = i.FindElement(By.CssSelector("div.caption > h4.pull-right.price")).Text; }) ); } private Func<IWebDriver, bool> NewPageLoaded(IWebElement oldPageElement, By newPageElementLocator) { return (driver) => { bool oldElementStale = IsElementStale(oldPageElement).Invoke(driver); bool newElementVisible = IsElementVisible(newPageElementLocator).Invoke(driver); return oldElementStale && newElementVisible; }; } private Func<IWebDriver, bool> IsElementStale(IWebElement element) { return (driver) => { try { return element == null || !element.Enabled; } catch (StaleElementReferenceException) { return true; } }; } private Func<IWebDriver, bool> IsElementVisible(By elementLocator) { return (driver) => { try { var newPageElement = driver.FindElement(elementLocator); return newPageElement.Displayed ? true : false; } catch (StaleElementReferenceException) { return false; } }; }
In the next and final part of this web scraping tutorial, we’ll turn our script into a shareable workbook-application that any user with the QueryStorm Runtime can execute.