Question

How to configure selenium webdriver to scrape data on server

I have a web scrapper written in python with the library selenium that works on my local machine. When I push the data to my droplet, I cannot run that app. This is one of my web scrapping methods:

def defense_dash_lt10(pbp_stats, season):
    # Less than 10 foot
    url = 'https://www.nba.com/stats/players/defense-dash-lt10?Season=' + season

    options = Options()
    options.add_argument('--no-sandbox')
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options)

    driver.get(url)

    selects = driver.find_elements(By.CLASS_NAME, "DropDown_select__4pIg9 ")
    for select in selects:
        options = Select(select).options

    for option in options:
        if option.text == 'All':
            option.click() # select() in earlier versions of webdriver
            break

    # Find the table element
    table = driver.find_element(By.CLASS_NAME, 'Crom_table__p1iZz')

    # Find all rows in the table
    rows = table.find_elements(By.TAG_NAME, 'tr')
    defense_dash_lt10 = []

    # Loop through each row and extract the data from each cell
    for row in rows:
        player_dd_lt10 = []
        # Find all cells in the row
        cells = row.find_elements(By.TAG_NAME, 'td')
        for cell in cells:
                player_dd_lt10.append(cell.text)
        # Add pbp stats to defense dash if not empty
        if player_dd_lt10:
            if player_dd_lt10[0] in pbp_stats:
                defense_dash_lt10.append(player_dd_lt10 + pbp_stats[player_dd_lt10[0]][-2:])
            else:
                defense_dash_lt10.append(player_dd_lt10 + ['NaN', 'NaN'])
    header = ['Player', 'Team', 'Age', 'Position', 'GP', 'Games', 'FREQ%', 'DFGM', 'DFGA', 'DFG%', 'FG%', 'DIFF%', "MP", "BLKR"]
    return header, defense_dash_lt10
Show comments

Submit an answer


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

KFSys
Site Moderator
Site Moderator badge
February 12, 2024

Heya,

To run a Selenium-based web scraper on a server, such as a DigitalOcean droplet, you need to configure it to work in a headless environment. Servers typically don’t have a GUI, so you can’t run browsers in the regular, graphical mode. Here’s how to modify your existing Selenium setup to work on a server:

  1. Install Necessary Packages on the Server:
  • Ensure Python is installed on the server.

  • Install the necessary drivers and browser. For Chrome, you’ll need ChromeDriver and the Chrome browser itself. You can install them using your server’s package manager. For example, in Ubuntu:

sudo apt-get update
sudo apt-get install -y unzip xvfb libxi6 libgconf-2-4
sudo apt-get install default-jdk 
sudo apt-get install -y google-chrome-stable
  • Download ChromeDriver matching your Chrome version:
wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver
  • Install Selenium for Python if not already installed:
pip install selenium
  1. Modify Your Selenium Script to Use Headless Mode: Update your script to include headless options for the browser. Here’s an example modification for Chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')

driver = webdriver.Chrome(options=options)

This will initialize Chrome in headless mode, allowing it to run without a GUI.

  1. Run Your Script on the Server:
  • Transfer your script to the server.
  • Run your script as you normally would on your local machine.
  1. Additional Considerations:
  • Memory Usage: Selenium can be resource-intensive. Ensure your server has enough resources (RAM, CPU) to handle the load.
  • Execution Time: Web scraping can be slow, especially in headless mode. Consider this in your server’s resource planning.
  • Error Handling: Add robust error handling to your script to manage network issues, unexpected page structures, and other runtime errors.
  • Legal and Ethical Considerations: Ensure that your scraping activities comply with the terms of service of the website and relevant laws (like GDPR, if applicable).
  1. Debugging: If your script doesn’t work as expected, add more verbose logging to understand where it fails. Sometimes issues can arise due to differences between the local and server environments (such as different versions of packages or browsers).

Remember, running a web scraper on a server is essentially the same as running it locally, with the key difference being the headless setup and ensuring all dependencies are correctly installed on the server.

Bobby Iliev
Site Moderator
Site Moderator badge
February 9, 2024

Hi there!

Running a Selenium-based web scraper on a DigitalOcean Droplet involves several considerations that differ from running the script on your local machine. You will have to set up a headless browser environment, managing web driver installations, and ensuring your script can run in a non-GUI server environment.

Here’s how you could do that:

1. Install Required Packages

Ensure your Droplet is up to date and has Python installed. You’ll also need to install Selenium and a web driver manager, such as webdriver-manager, which simplifies the management of binary drivers for different web browsers.

You can install these using pip. If you haven’t installed pip, you can install it using your package manager (e.g., apt for Ubuntu/Debian).

sudo apt update
sudo apt install python3-pip
pip3 install selenium webdriver-manager

2. Install a Web Browser and WebDriver

For headless operation, you can use Chrome or Firefox. This example uses Chrome, but the process is similar for Firefox.

  • Install Google Chrome:

    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    sudo apt install ./google-chrome-stable_current_amd64.deb
    
  • Install ChromeDriver: The webdriver-manager package you installed earlier will handle the ChromeDriver installation in your Python script, so you don’t need to manually install ChromeDriver.

3. Modify Your Script for Headless Operation

To run your browser in headless mode (without a GUI), you need to modify your Selenium script to specify headless options.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')  # Runs Chrome in headless mode.
options.add_argument('--no-sandbox')  # Bypass OS security model
options.add_argument('--disable-dev-shm-usage')  # Overcome limited resource problems

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

4. Running Your Script

Now, you should be able to run your script on the server just like you would on your local machine. Ensure you’re using the correct Python command (python3 or python) based on your server’s configuration.

python3 your_script_name.py

If you encounter any issues, reviewing the error messages and logs can provide insights into what might be going wrong!

Hope that this helps.

Best,

Bobby

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Resources for startups and SMBs

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.