8000 GitHub - amohan14/Webscrapping: Implementation of webscrapper python script using ansible playbook.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Implementation of webscrapper python script using ansible playbook.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



10 Commits

Repository files navigation


  • Install Anisible on local machine or virtual machine

      $ sudo apt update
      $ sudo apt install software-properties-common
      $ sudo apt-add-repository --yes --update ppa:ansible/ansible
      $ sudo apt install ansible
      $ sudo yum install python3-pip
      $ sudo pip install --upgrade pip
      $ sudo pip3 install ansible
  • Ansible will connect to AWS using boto SDK. So we have to install boto and boto3 ackages on our local machine or VM

      $ sudo pip3 install boto boto3
  • Install AWS CLI on the VM

      $ pip3 install awscli --upgrade --user
  • Go to AWS console and create an IAM User. Add it to a group each having AmazonEC2FullAccess permissions.

  • Copy the user's AWS Access Key ID and AWS Secret Access Key

  • Return to Virtual Machine. Configure AWS with the user keys and set the default region as per requirement

      $ aws configure
  • Also create an .aws/credentials file and copy the keys and default region to it so that ansible can access the keys accordingly

  • Create a keypair and ensure the keypair.pem file is accessible to virtual machine.

  • Create an EC2 Instance(CentOs machine) using Ansible playbook and add host to group 'just_created' with variable foo=42

      - name: Create an EC2 instance
          hosts: local
          connection: local
          gather_facts: False
              - name: Launch instance
                  key_name: ansible-lab
                  group: ansible-node
                  instance_type: t2.micro
                  image: ami-02354e95b39ca8dec
                  wait: true
                  region: us-east-1
                  # aws_access_key: "{{ lookup('env', 'AWS_ACCESS_KEY') }}"
                  # aws_secret_key: "{{ lookup('env', 'AWS_SECRET_KEY') }}"
              register: ec2
              - name: Print all ec2 variables
              debug: var=ec2
              - name: Get the Ip address
              debug: var=ec2.instances[0].public_dns_name
               - name: add host to group 'just_created' with variable foo=42
                  name: "{{ ec2.instances[0].public_dns_name }}"
                  groups: ec2_hosts
                  ansible_host: "{{ ec2.instances[0].public_dns_name }}"
                  ansible_ssh_user: ec2-user
                  ansible_ssh_private_key_file: /vagrant_data/ansible-lab.pem

    Replace the following values in the playbook:

      hosts: localhost
      key_name: name of the key-pair you will use to ssh to ec2 instance.
      group: security group of ec2 (ssh port should be open for the security group used)
      instance_type, image : as per the ec2 instance you want to create.
      region: as per the requirement.

    Make sure that the full path to .pem key file is profided to the "ansible_ssh_private_key_file" parameter

  • Install packages onthe EC2 instance: This step can be done manually after sshing to the EC2 instance or automatically through ansible playbook. First method: Execute the below commands based on the OS of EC2 instances for installing following packages: 1. python

      $ sudo apt-get update && sudo apt-get upgrade -y
      $ sudo apt-get install python3.7
      $ sudo yum install -y https://repo.ius.io/ius-release-el7.rpm
      $ sudo yum update -y
      $ sudo yum install -y python36u python36u-libs python36u-devel python36u-pip

    2. pip

    CentOS-7 and higher

    Install pip in CentOS, using yum and python3 package manager:

      $ sudo yum install python3-pip
      $ sudo pip install --upgrade pip

    Install pip in Ubuntu, using apt-get package manager:

      $ sudo apt-get update -y
      $ sudo apt-get install python3-pip
      $ sudo pip install --upgrade pip

    3. git

      $ sudo yum install git
      $ sudo apt-get install git

    4. mariadb server

      $ sudo yum install mariadb-server

    For starting the mysql server.

      $ sudo systemctl start mariadb
      $ sudo systemctl status mariadb
      $ sudo echo -e "\n\nroot\nroot\n\n\nn\n\n " | mysql_secure_installation 2>/dev/null

    Note: -e enable interpretation of the following backslash escapes 2>/dev/null will filter out the errors so that they will not be output to your console. In more detail: 2 represents the error descriptor, which is where errors are written to. ... /dev/null is the standard Linux device where you send output that you want ignored.

      $ sudo apt update
      $ sudo apt install mariadb-server
      $ sudo echo -e "\n\nroot\nroot\n\n\nn\n\n " | mysql_secure_installation 2>/dev/null

    4. BeautifulSoup4

      $ sudo pip3 install bs4
      $ sudo apt-get update -y
      $ sudo apt-get install -y python3-bs4
      $ sudo apt-get install -y python-beautifulsoup

    5. Requests

      $ sudo pip3 install requests
      $ sudo apt-get update -y
      $ sudo apt-get install -y python3-requests

    Second Method: Using Ansible Tasks to install all above packages:

      name: Install packages into ec2 hosts
      hosts: ec2_hosts
      become: yes
          - yum: pkg=python3 state=latest
          - yum: pkg=python3-pip state=latest
          - yum: pkg=git state=installed
          - yum: pkg=mariadb-server state=installed
          - shell: sudo systemctl start mariadb
          - shell: echo -e "\n\nroot\nroot\n\n\nn\n\n " | mysql_secure_installation 2>/dev/null
          - shell: sudo pip3 install requests bs4
  • After the packages are installed successfully, clone the git repository in which the webscrapper python script is present to the EC2 instance home directory.

      $ git clone https://github.com/amohan14/Webscrapping.git
  • Run the Webscrapper python file and it shuld successfully outputs a .csv file containing all the reviews and their corresponding details.

      $ python3 Webscrapping/yelp_reviews_scrapping.py

To automate the above 2 points, we can add task to ansible playbook as follows:

    - shell: git clone https://github.com/amohan14/Webscrapping.git        
    - shell: python3 Webscrapping/yelp_reviews_scrapping.py

You can find the complete Ansible Playbook


No releases published


No packages published
