Breaking

Tuesday, April 9, 2019

How to Build a Web Scraper in Python


How to Build a Web Scraper in Python

Web scraper is a piece of code, also known as bot, used for gathering data from websites. You can build a web scraper by using Python, a general purpose programming language.

First off, why do we need a web scraper? Information is in abundance on the the internet these days. However, they are scattered all over the web pages and not easily gathered. It is hard to be collected automatically because it is layered under various structures and levels of code. Basically, the content of a web page is wrapped inside HTTP tags or rendered by Javascript.

The data we need are frequently in the form of unstructured data, they are in the free-text format. Even if they are presented in tables, different websites have diverse table structures. If you are looking to extract the data, you need to find a tool to extract the right information or in other words to simply make the data usable.


Web scraping is in high demand by many companies to study the market, understand various trends and help make important decisions. Especially with the growth of machine learning, a large amount of data is required to get real pictures of the market.

Consequently, more programmers are interested learning web scraping to fill the gap. Without doubt, mastering this subject will enrich your skill set and makes you more adaptable in the workforce.

There are several steps you must performed to build a web scraper by using Python. 

  • First of all, you have to determine at least one URL as your target page. 
  • Secondly, you must tell the bot in which part (html tag) the data you want to collect is located. 
  • Lastly, your bot needs to parse the html code and sort the data that you need. 
  • Most importantly, you must install scrapy because it isn't the Python's standard library.



Things to Remember Before Scraping

Web scraping is fun, you can collect pretty much any information you want from the internet, prices, products for sale, location, email, phone number, images, you name it.

However, you must understand that not all website owners are happy when their web pages are crawled and scraped. The bot you send to their server basically makes it busier and may slows down their web.

The best way to go about this is by letting the site owner know of your intentions. Let the site owner know that you intend to scrape the website, why you need to do so and what kind of information will you be extracting. This way, it helps to build transparency and avoid any conflicts.

You may also check whether or not the site provides APIs (Application Programming Interface). If that the case, it means that you don't have to scrape their web. Their data are available to be accessed in different format, most commonly in JSON. In most cases, you need to have a token to access their API. Some websites require you to pay monthly subscriptions in order to gain the access.

In addition, websites are periodically changing. It could be just the look, but it also could be the entire system. They rebuild their website by using a different programming language or framework. Today, your scraper bot might be working well, but tomorrow it could be unsuccessfully collect data from the same site. That means you have to change your code repeatedly if you want to continue picking data from that site. 

Lastly, every web scraping project posses different challenges. This is mainly due to the target sites and the data you want to collect have variety of structures and located in different HTML tags. Furthermore, scraping data from websites is becoming more demanding because more and more contents are rendered by using JavaScript.

In this article, I want to share basic steps you need to follow when you want to scrape data from websites. 

How to Build A Web Scraper in Python?


To be a successful web scraper, you need to be creative and adaptable to any data scraping scenarios. However, there are basic steps you need to follow to collect data from the internet. Follow the below mentioned steps to build a web scraper in Python.

Step 1: URL Finder

The first and most important step is to find the URL which you wish to scrape. You do this with the help of a crawler. A crawler will help you browse websites easily and download the content for you to check.

To make things easier, try to answer these four “Wh” questions:

  • What kind of data is needed for your research?
  • What websites fall under the criteria?
  • Which website is more suitable for your needs?
  • Why do you need to scrape this particular website?

Once you have answered these questions and found your sites, once again you need to recheck whether or not they allow their site to be scraped. The rules of any site can be found in the robot.txt file. Accessing the robot.txt file is easy. Simply add / (forward slash) after the domain and type in robot.txt after it. Then you can start with your scrapping process.

Step 2: Identifying Which Elements to Scrape

As mentioned before, web scraping is used for a specific search. Hence you need to figure out in which HTML element of the site you wish to scrape and determine which section is relevant to you.


For this, you can use the 'inspect Element’ or 'view Source’ option on Google Chrome, if you are a Chrome user, or even download the source code of the HTML page you are viewing.

Step 3: Install Scrapy

This is where you start your scraping process. Scrapy is nothing but Python scraping framework for extracting data easily. The best part about Scrapy is that it comes equipped with built-in tools for fetching and extracting data. Hence you don’t need another supporting tool to do any of this bit for you.


For instance, if you use Beautiful Soup for extracting, you need to use either Request or urllib2 to fetch the data for you. Scrapy was earlier designed only for scraping but with growing developments in the framework, it is now also used as a very powerful web crawler. Hence, there is no need for a separate crawler.


To install Scrapy, all you need to do is use the following code:

$ pip install scrapy 


Step 4: Parsing the Code with Scrapy

Here you start with some actual coding. You will find various methods of coding online to continue with your web scraping process. The scraping codes are really very simple and very easy to implement. Once you get a hang of it, it is easier for you to scrape various websites.

Step 5: Sorting the Results

Now that you have fetched the entire information from an HTML tag, you need to sort the core data that you really need. For example, if you are gathering data about price from different websites, your data would most likely contain dollar or another currency sign. You need to remove the currency symbol in order to store the data into integer or float format.

Another project may require you to collect data from the same HTML tag and then store them into different column in the database. Perhaps you want to gather data about burglary in a city within one year period. You've already crawled and scraped data from several local newspaper websites. All information you need are available on the content of the news. However, you need to sort separate data like street name, suburb name, time of the incidence into different column of the table.

So, you will have to process the data a bit before you can start using it. You must ensure that the data are in the right format and stored to the right table. This can be done with the help of regular expression or regex.

Once you collect data to your own database, you can use it as an input to your data science project.

The more you practice the better you become a web scraper. You should try any random web scraping project. Think of something that you want to search and need the data for, for instance, the best Netflix program so far.

Other projects you must try are collecting data from social media (twitter), visitor's comments or reviews, price tags from online marketplace or hotels, name of cities/counties/suburbs mentioned on news sites.

Conclusion

Scrapy is a powerful Python's web scraping and crawling library. And now you know the main steps to build a web scraper. To be proficient with this subject, you must practice with various data mining schemes.

You can start learning web scraping by following numerous tutorials. A good book for you to start learning this subject is Learning Scrapy. This book sits between simplistic and inefficient tutorials on the web and complex textbooks available on the market.


No comments:

Post a Comment