Scrape Your Way To the Interviews — Part 2

Maria Vasilenko
3 min readSep 18, 2020

In this article, we will continue building our Glassdoor Interviews reviews scraper (check out the beginning here) and extend our code to include the functionality of collecting reviews for multiple companies and add logging.

First off, let’s create a .json file with the companies (lower case, please!), which interview reviews you want to get.

Photo by Artem Sapegin on Unsplash
{"companies": "microsoft, google, facebook, amazon, apple"
}

Basically, now we only will need to read the company names from the companies.json file and create a loop in which we will collect interview reviews for each of the companies as we did in Part 1.

companies_list = []
with open('companies.json') as f:
companies = json.load(f)
for comp in companies['companies'].split(','):
companies_list.append(comp)

As you might have noticed already, there are many moving parts in the project, and there are many ways in which our program might fail. To keep track of problems and get warnings on time, I use logging a lot!

If you haven’t used it before, I highly recommend starting! The documentation has a great intro tutorial and example code. In a nutshell, logging allows catching all sorts of problems that arise during the program runtime. You have the option of showing the messages in the terminal, or saving them into the file, sending to a specified e-mail, HTTP GET/POST locations, and many other options. There are five primary levels of logging, each different in the severity of events it is tracking:

  • CRITICAL(tracking only most serious errors)
  • ERROR
  • WARNING (the default and captures some unexpected events, while the code is still running)
  • INFO
  • DEBUG(the most detailed level, captures errors of all the levels from above).

The way I am using logging in my scraper is by creating an instance of the Logger class and calling methods .info() in case whenever I want to see the progress, or .error() in case there is a failure.

Let’s consider a piece of our scraper program where we check if we landed the company page.

Code snippet for checking if the scraper reached the company page

If the scraper can get the URL of the current page, i.e., land onto the company page, we will see the INFO alert in the terminal with our custom message “Current company URL:…”. However, if the scraper is not successful in reaching the company page, we will see the custom error alert “Not able to find the company URL.”

Another nice feature of logging is that you can create separate loggers for different modules of your program. That way, each logger message will refer to a particular module of its origin.

For example, I organized my code (see my Glassdoor web scraper code here) as follows.

GLASSDOOR SCRAPER
---— main.py
|
|--— parse_utils.py
|
----requirements.txt

The file parse_utils.py contains all of the functions needed to get our work done, like logging into an account, entering location and company name, getting reviews, etc. The main.py is where we call these functions and perform the data collection.

For example, I created the logger in the main.py module.

# main.py logger instance
logging.config.fileConfig('logging.conf')
logger = logging.getLogger('root')

Similarly, I create the logger instance in the parse_utils.py module:

logging.config.fileConfig('logging.conf')
logger = logging.getLogger('parser')

Now, if we run a piece of main.py code, which calls the function get_reviews() from the parse_utils.py module, we will see the logger messages that come from different modules!

Running the scraping script prompts the following messages from the logger

To sum up, logging functionality helps track the progress of running your software and turns out to be a potent tool when collecting the data programmatically.

That concludes my beginner’s tutorial on building your web scraper to collect Glassdoor’s interview review, and — hooray! — we can now collect thousands of Glassdoor reviews.

However, what is the best way to store and access those reviews? Locally or on the server? It’s a great practice to have your data organized and accessible from anywhere. In the next article, we will learn how to build an API that allows us to store and access the data on Heroku.

Stay tuned!

If you found this article helpful, please spread the word by liking, sharing, or leaving a comment.

In the meantime, you can check my other articles here.

--

--

Maria Vasilenko

Data Scientist | Data Engineer| Economist| Mom| ❤️ Data, decision science, behavioral economics, cognitive science, biohacking