Scraping Clutch with Go
Let’s create a new file in our work directory and call it clutch.go. Next, we will use colly framework for our scraping job, a well-written framework, and we recommend you read the documentation. You can install colly with a single-line command by copying and pasting it in your terminal or command prompt. It might take some time, but it gets installed eventually.
- Install Colly
Now, open the file (clutch.go) in your favourite IDE. We begin with specifying the package before writing the main function.
Don’t forget to run the code and verify everything work as expected.
The first thing we need to declare inside the function is a filename.
Up next, we will create a file since we have a filename.
It will create a file titled london_digital_agencies.csv; now, run the code and check for errors.
How do we catch errors? Well, let’s define it in our code.
Fatalf() prints the message and exits the program.
The next thing we need to do is close the file.
Here’s where defer is very helpful. The moment you write defer, the following codes execute afterwards and not right away. Amazing right? Si, si; it means we don’t have to worry about closing the file manually.
By default, Go imports the necessary packages.
Let’s progress our code a bit. What you need next is a CSV writer. Why? We need to write the data we fetch from clutch.co to a CSV file.
Go will import another package automatically after adding a writer, known as “encoding/csv”. Pretty neat, right?
We need to throw everything from the buffer into the writer after writing our data to the file. For this, we need to use Flush.
Because we perform this process afterwards and not right away, we need to add the keyword defer. Finally, we have a well-structured file and a writer ready to go. It is time to get our hands dirty and start the web scraping job. We need to instantiate a collector to begin.
Go would have imported colly for us. So the next thing on our to-do list is to specify the domain name to extract the data. We will scrape a list of digital agencies providing services in London, United Kingdom, from Clutch.
Clutch is the leading ratings and reviews platform for IT, Marketing and Business service providers. Each month, over half a million buyers and sellers of services use the Clutch platform, and the user base is growing over 50% a year.
The next thing we need to do is point to the web page from where we will fetch the data. Here is how we are going to do it.
We will fetch data from this page.
We are interested in collecting “name”, “logo”, “rating”, “tagline”, “locality”, and “clutch_profile.” After inspecting the page, we discovered provider-info is our target tag.
We have created a pointer to that HTML element, pointing to the provider-info tag. Using the above code, we will write the data into our CSV file. The writer function will type the slice of a string. We need to specify what we need precisely. ChildText will return concatenated and stripped text of matching elements. Inside that, we have passed a tag a to extract all the elements with tag a. e.t.c. We have applied a comma because we are writing a CSV file. We also need ChildText of img tag to get the logos.
Phew, all done! It is time to build our clutch scraper.
It generates a Unix Executable File: clutch. You can execute the file by running the following command in your terminal:
- Scraping Job Finishes
Finally, you can look inside the file london_digital_agencies.csv Go created for us and preview the collected data.
We decided to refine our code a bit. Now, it includes the URL of digital agencies in London. In addition, there’s a condition to retrieve agencies’ logos by taking the “src attribute” if images leverage the performance optimization strategy known as Lazy Loading or not. The latest code is available here.
Contact us; we can help you create advanced and complex web scrapers that’ll crawl thousands and even millions of pages. Similarly, data can be in multiple formats (.csv and .json). Or, we can automatically send it to cloud storage like Amazon S3, Azure Blob Storage, and Google Cloud Storage. Want to ingest the data to a database of your choice? We got you covered.