0. Getting Started With Scrapy

Scrapy at a glance
What is Scrapy?
Scrapy is a python framework for crawling a website or a webpage.
Using Scrapy is easy but a little tricky , such that anyone can become fond of this . It requires some tricks and logic.
Uses of Scrapy: Automation Testing , Web Scraping, Data Mining, Information/text Processing.
A Moral Story:
A web designer made a website in 6 months and inserted data into the website which cost $10000 or more and 2-3 years of efforts. He was into the business and he was doing well. He has millions of information on his website.
As the time passed competition became tuff and many new website started in a short time. He was confused that how were other websites getting the data like him. Were they spending a lot of money for collecting the information?  
What do you think?
For my view, No it’s not necessary to spend money on data if you how to use scrappy. Yes, Data can be captured from other websites and could be stored in Data Base, CSV Files, Json Files.
If you got some interest then go on…….to next page  













Installing Scrapy and other supported files:
//Note :  All the files are  available as .exe file for 32-bit python 2.7 & the link given below may change any  time //
The Information is only for windows operating system.
1.       Install Python 2.7 of 32-bit version.
Python 2.7 32-bit is a stable version has no problem of version with other Python packages. Don’t install 64-bit it may cause some problem in future.
Link for python 32-bit windows MSI file:  http://www.python.org/ftp/python/2.7/python-2.7.msi

Go to the downloaded Python MSI file you have downloaded and install it by double clicking on it.

2.       Install easy_install  for installing other python supporting packages without problems.

3.       Add the C:\python27\Scripts and C:\python27 folders to the system path by adding those directories to the PATH environment variable from the Control Panel.
Start->search(variable)->edit system path variable->


Edit the path of system as shown below and click OK to all. Now start CMD  by typing CMD into the run. Type python on CMD and press enter<-. If you see a python interactive shell than all is done is correct.













4.       install OpenSSL by following these steps:
o    go to Win32 OpenSSL page
o    download Visual C++ 2008 redistributables for your Windows and architecture
o    

download OpenSSL for your Windows and architecture (the regular version, not the light one)

o    add the c:\openssl-win32\bin (or similar) directory to your PATH, the same way you addedpython27 in the third step.
5.        Open CMD and type easy_install Scrapy and press Enter key

6.       Installing py-win32 Link:  http://www.lfd.uci.edu/~gohlke/pythonlibs/#pywin32
  1. Twisted: http://twistedmatrix.com/trac/wiki/Downloads
8.       Zope interface: https://pypi.python.org/simple/zope.interface/
Down load 32-bit MSI or exe file for your ease.  MSI or EXE available in the list.
9.       lxml 32-bit binary link: https://pypi.python.org/pypi/lxml/3.3.3
  1. pyOpenSSL: https://launchpad.net/pyopenssl
Up to last chapter you have learned how to install scrapy and other supporting file. From this chapter you will learn the scrappy. Scrapy follows the file pattern like Django which is an efficient way to manage program source codes and files. Now let’s move to an example:
1.       How to start?
Most people waste their time in the beginning due to the lack of knowledge , So don’t waste the time and follow the following procedure:-
a.       Run CMD and move to a directory for you code & type : scrapy startproject project_name
Note: Some CMD commands for you:-
1.       cd..                                                                         to move previous directory.
2.       mkdir dir_name                                                to create a new folder/directory.
3.       cd dir_name                                                       to change directory.
I am here C:\scrapy_book>



b.      C:\scrapy_book> scrapy startproject scrap_youtube
c.        After executing this command you will see a folder name scrap_youtube and inside it one more scrap_youtube :  C:\scrapy_book>scrap_youtube>scrap_youtube
d.      C:\scrapy_book>scrap_youtube>scrap_youtube

e.      C:\scrapy_book>scrap_youtube>scrap_youtube folder contains 4 python files and one directory/folder:
1.       Items.py
2.       Pipe_line.py
3.       Settings.py
4.       __init.py__
5.       Dir > spiders
If you are a experienced programmer who have some exposure to django must be aware of the names. We will discuss about these files after a sample program.

How and Where to write the program?
You can type your program in notepad file with .py extension or python 2.7 comes with python IDLE for python program writing.
Go to scrapy project folder and edit file items.py like this:
 And save(ctrl+S) it.
Now move to the folder spiders and create a file inside it named as spider1.py


No comments: