Introduction
Web scraping is something that will come up sooner or later in a developer’s career – be it for a business requirement or for personal use. So, how do we tackle web scraping when the time comes?
First, let’s split the scraping task into three subtasks:
- Getting the data
- Parsing the data to get our results
- Outputting the results (most commonly to an Excel spreadsheet or CSV file)
Based on the subtasks, the most obvious choice would be to use Python for web scraping because:
- It’s easy to set up for a developer
- There are a lot of well-documented libraries (or even frameworks -> Scrapy) to help us out with the subtasks
- Getting the data – Requests for static websites, Selenium with Python for dynamic websites (preferred)
- Parsing the data – Beautiful Soup
- Outputting the results – csv, pandas or xlsxwriter
- As a consequence we can accomplish our goal without writing a lot of code
Wait, why not Python?
There are a couple of issues that come with the Python approach.
Validating output
Validating the output of our Python script can prove to be a time-consuming task. Every time we run our script we have to open the generated output file to see if it matches our demands.
This isn’t an overwhelming issue if we have to scrape only one website with simple markup where there isn’t much room for mistake when writing the script. However, if we have to scrape multiple complex sites (pagination, AJAX, the data needs a lot of work to format it according to our liking) suddenly we lose a lot of time just by validating the results from the generated file(s).
Sharing scraped data
The other issue that comes to mind is sharing the data with colleagues/friends that are not developers.
An obvious solution would be to send a new copy of the data every time you run the script. However, this approach is not in accordance with the developer mindset as we want to automate tasks as much as possible.
An alternative would be to set up Python on their machine, send the script and explain how to execute it. That means you’d have to set a respectable amount of time aside (especially if you have to do it multiple times), or even worse, your colleagues/friends might refuse to set up Python/learn what they have to do in order to get the data when they want.
This poses the question: Is there an alternative way in which we could leverage something as useful as Selenium for web scraping, while at the same time overcome the aforementioned issues?
The answer is yes: QueryStorm lets you use C# inside of Excel and solves both issues, and there is also a Selenium NuGet package we can use for scraping. In addition – you don’t have to worry about writing the code for outputting the results to a CSV/XSLX file.
How does QueryStorm solve these issues?
Since you can use C# literally inside of Excel with QueryStorm, you don’t have to keep re-opening files to validate them – they are already in the same window as the (QueryStorm) IDE.
Also, the C# code is a part of the Excel workbook, so sharing is easy – just share the workbook file. The only setup the recipient needs to do is install the QueryStorm runtime with a simple (and small – 4 MB) installer and that’s it, they are ready to run the scraper on their own.
Enough talk, letâs get started and see how we would accomplish this!
Prerequisites and creating a workbook project
Of course, you need QueryStorm to follow this tutorial. You can generate a trial key and download (and install) the full version from the downloads page.
As with every âweb scraping with Seleniumâ tutorial, you have to download the appropriate driver to interface with the browser you’re going to use for scraping. Since we’re using Chrome, download the driver for the version of Chrome you’re using. The next step is to add it to the system path. For more details check the official Selenium documentation. (Note: If you’re sharing the scraper, the recipient must also download the driver and add it to the system path.)
Open up Excel with a blank workbook, select the QueryStorm tab and click on the C# script button.
This will create a new workbook project and open it in the QueryStorm IDE.
Adding the Selenium NuGet package
To work with Selenium, we have to add the Selenium NuGet package to the project. Open the package manager either by clicking on the Manage packages button in the ribbon or by right clicking the project in the IDE and selecting the Manage packages context menu item.
Next, select the NuGet Packages tab item and search for the Selenium.WebDriver package. Now we can install the package by pressing the blue install package button. Wait for the package to be installed and exit the Package manager.
Navigating to a URL with Selenium
Now we’re ready to write some code. Let’s start off by creating an instance of a Chrome WebDriver (the driver is an IDisposable object, so it should be instantiated in a using statement) and navigating to a URL (I’ll be using this scraping test site). Additionally, let’s wait for 5 seconds before the browser is closed.
using System.Threading; using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using (IWebDriver driver = new ChromeDriver()) { driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax"); Thread.Sleep(5000); }
You can run the script by pressing the Run button in the ribbon or by pressing F5.
If youâve downloaded the driver and correctly added the directory with the driver to the system path, this example shouldâve opened a Chrome window, navigated to the specified URL and closed after 5 seconds.
Scraping item names
Letâs say that instead of waiting for 5 seconds and closing the browser, we want to get the names of the top items being scraped right now from the example homepage. In addition, we can also save them to the current workbook. How do we do that?
First, create a table in the spreadsheet (CTRL + T when selecting cell(s)), name it ResultsTable and name the column Results.
Now return to the C# script and start typing ResultsTable. Youâll see the table is shown in the code completion window and we can use it to save our results. How cool is that?
To get the complete names of the items, we have to find the item elements with the driverâs FindElements method. Therefore, to find the elements we have to supply the CSS selector that specifies the items. You can find the selector with the help of Chrome’s DevTools (it’s h4 > a). Finally, to get the complete item name, we have to get the element’s title attribute (with the IWebElementâs GetAttribute method).
using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using (IWebDriver driver = new ChromeDriver()) { driver.Navigate().GoToUrl("https://webscraper.io/test-sites/e-commerce/ajax"); var items = driver.FindElements(By.CssSelector("h4 > a")); items.ForEach( i => ResultsTable.AddRow(r => r.Results = i.GetAttribute("title"))); }
Now when we run the script, we can see the results in our ResultsTable! Weâve managed to scrape useful data in just a couple of lines of code.
Weâll expand on this example and scrape more data in the next part of the tutorial.