How can we scrape a single website?In this case, we don't want to follow any links. The topic of following links I will describe in another blog post. Carelift zoom boom.
I am using Jupyter Notebooks as it is more convenient to have have live output of the code and it also offers many other features. You could work with any editor or IDE of your choice but keep in mind that we are working with python3 and some of the libraries may not be the same in python2. Right-click on the the element then select 'Inspect' in the menu. Developer tools opens and and highlights the element. Right-click the code element in developer tools, hover over 'Copy' in. In this Python for Data Science tutorial, you will learn about Web scraping and Parsing in Python using Beautiful Soup (bs4) in Anaconda using Jupyter Noteb. If you are using Anaconda distribution of Python, Jupyter notebook is already included in it. To install it individually in standard Python distribution, use a pip installer. Pip3 install jupyter. This will install the entire Jupyter system including notebook, QtConsole, and IPython kernel. It is also possible to install a notebook only.
First of all, we will use Scrapy running in Jupyter Notebook.Unfortunately, there is a problem with running Scrapy multiple times in Jupyter. I have not found a solution yet, so let's assume for now that we can run a CrawlerProcess only once.
In the first step, we need to define a Scrapy Spider. It consists of two essential parts: start URLs (which is a list of pages to scrape) and the selector (or selectors) to extract the interesting part of a page. In this example, we are going to extract Marilyn Manson's quotes from Wikiquote.
Let's look at the source code of the page. The content is inside a div with 'mw-parser-output' class. Every quote is in a 'li' element. We can extract them using a CSS selector.
What do we see in the log output? Things like this:
It is not a perfect output. I don't want the source of the quote and HTML tags. Let's do it in the most trivial way because it is not a blog post about extracting text from HTML. I am going to split the quote into lines, select the first one and remove HTML tags.
Install Jupyter Notebook
The proper way of doing processing of the extracted content in Scrapy is using a processing pipeline. As the input of the processor, we get the item produced by the scraper and we must produce output in the same format (for example a dictionary).
It is easy to add a pipeline item. It is just a part of the custom_settings.
There is one strange part of the configuration. What is the in the dictionary? What does it do? According to the documentation: 'The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes.'
What does the output look like after adding the processing pipeline item?
Much better, isn't it? Firefox 48.0.2.
How can we scrape a single website?In this case, we don't want to follow any links. The topic of following links I will describe in another blog post. Carelift zoom boom.
I am using Jupyter Notebooks as it is more convenient to have have live output of the code and it also offers many other features. You could work with any editor or IDE of your choice but keep in mind that we are working with python3 and some of the libraries may not be the same in python2. Right-click on the the element then select 'Inspect' in the menu. Developer tools opens and and highlights the element. Right-click the code element in developer tools, hover over 'Copy' in. In this Python for Data Science tutorial, you will learn about Web scraping and Parsing in Python using Beautiful Soup (bs4) in Anaconda using Jupyter Noteb. If you are using Anaconda distribution of Python, Jupyter notebook is already included in it. To install it individually in standard Python distribution, use a pip installer. Pip3 install jupyter. This will install the entire Jupyter system including notebook, QtConsole, and IPython kernel. It is also possible to install a notebook only.
First of all, we will use Scrapy running in Jupyter Notebook.Unfortunately, there is a problem with running Scrapy multiple times in Jupyter. I have not found a solution yet, so let's assume for now that we can run a CrawlerProcess only once.
In the first step, we need to define a Scrapy Spider. It consists of two essential parts: start URLs (which is a list of pages to scrape) and the selector (or selectors) to extract the interesting part of a page. In this example, we are going to extract Marilyn Manson's quotes from Wikiquote.
Let's look at the source code of the page. The content is inside a div with 'mw-parser-output' class. Every quote is in a 'li' element. We can extract them using a CSS selector.
What do we see in the log output? Things like this:
It is not a perfect output. I don't want the source of the quote and HTML tags. Let's do it in the most trivial way because it is not a blog post about extracting text from HTML. I am going to split the quote into lines, select the first one and remove HTML tags.
Install Jupyter Notebook
The proper way of doing processing of the extracted content in Scrapy is using a processing pipeline. As the input of the processor, we get the item produced by the scraper and we must produce output in the same format (for example a dictionary).
It is easy to add a pipeline item. It is just a part of the custom_settings.
There is one strange part of the configuration. What is the in the dictionary? What does it do? According to the documentation: 'The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes.'
What does the output look like after adding the processing pipeline item?
Much better, isn't it? Firefox 48.0.2.
I want to store the quotes in a CSV file. We can do it using a custom configuration. We need to define a feed format and the output file name.
There is one annoying thing. Scrapy logs a vast amount of information.
Jupyter Notebook Tutorial
Fortunately, it is possible to define the log level in the settings too. We must add this line to custom_settings:
Remember to import logging!