Find Jobs
Hire Freelancers

Web scraper for electricity use data

$100-500 USD

Zavřený
Zveřejněno skoro před 13 roky

$100-500 USD

Zaplaceno při doručení
We are seeking to build a web scraping tool that downloads energy use and billing data from a customer's online utility account. The tool will consist of two major components: 1. a data retrieval component that performs the tasks of managing user credentials, logging into the user account, and downloading the relevant data and metadata; and 2. a data engine that normalizes the retrieved data and inserts it into a database with a standard structure. We will retrieve use and billing data for two types of energy: electricity and natural gas. The tool should be constructed in such a manner that it can be mapped to additional utilities in the future, presumably by modifying the data retrieval component to handle different web sites and data formats. However, for the first stage of this project we will be scraping data from a single web site. The interface requirements for the tool are minimal. For the first version, we expect to run it from the command line. Because we anticipate that this project will lead to follow-on work and could expand into a sizeable piece of software, we are for the first phase looking to establish a good working relationship with a motivated and qualified contractor. The first phase will therefore involve only a very simple script that logs into a target web site and downloads the most recent twelve months of electricity data in flat files. ## Deliverables **NOTE**: This document describes the complete requirements for the system, but presently we are only requesting bids for the first phase of this project, which represents a small subset of the total functionality. We are seeking to establish a relationship with a vendor we trust, and we will then begin work on follow-on phases. These requirements are also described in the attached Word document. ## Overview Over the course of several phases we envision for this project, we are seeking to build a web scraping tool that retrieves energy use and billing data from a utility's web site. A typical use scenario is as follows: 1. User provides login credentials for utility web site. The user's login will typically be associated with various accounts for which data is tracked 2. System logs into the utility web site and retrieves metadata related to the customer account. Based on this metadata, the system sets up appropriate database structures to store the related energy use and billing data. 3. System reports back available date ranges and other parameters such as available accounts 4. User specifies a date range, accounts and other parameters for downloading energy use and billing 5. System logs into the utility web site and downloads a file or set of files containing the specified data. 6. System converts the downloaded data into a standardized format and inserts it into a relational database. 7. System reports the status of the retrieval and any error conditions. 8. In the future, the user can instruct the system to retrieve any new data published since the last retrieval was performed. In addition, the system must handle certain common error scenarios, such as failure to log in; detection of account changes; etc. Although this document will describe the complete system, the deliverable for the first phase is a small subset of the total functionality. ## Scope for Phase 1 For Phase I, the desired system will: 1. Log into a single utility web site 2. Download a set of flat files corresponding to the most recent 12 months of electricity use data 3. Log the results No database integration is required for this phase. The interface for the tool will be via the command line. The successful bidder will be expected to install the software on our server and provide any necessary configuration to our server environment. ## Programming environment and database The tool will run on a Linux-based server with a MySQL database: * Linux Version [login to view URL] * MySQL Community Server 5.1.56 We have no firm requirement regarding programming language, as long as the chosen environment meets our needs for scalability and speed. Although each individual data retrieval will likely be relatively small (5MB or less), the master database could easily grow to 100 million rows. The programming environment and database structure should be able to handle processing tasks and common data maintenance tasks (CRUD) at these volumes. A desired but not necessary feature is that the tool also run on a Mac installed with OS 10.6 and MySQL. Most software written for a Unix-based environment should meet this requirement. ## Data description We will provide a more formal data definition at the start of the project, but the following definitions should aid in understanding the scope of the project. > CUSTOMER. A person or organization that uses energy and receives an energy bill. > > UTILITY. An entity that sells energy, keeps records of CUSTOMER energy use, and issues bills for energy services. > > ACCOUNT. A billing relationship between a CUSTOMER and a UTILITY. A single CUSTOMER can have more than one ACCOUNT with a single UTILITY. > > COMMODITY. A type of energy delivered. For this project, we are concerned with two types of COMMODITIES: electricity and natural gas. > > FACILITY. A physical site where energy is consumed. Typically a facility is a building such as an office or warehouse, but it might also be other types of structures such as parking lots or radio antennas. > > METER. A physical device used to track the consumption of a COMMODITY at a FACILITY. METERS record time-series data at varying levels of granularity. Some METER data is reported on a monthly basis; some is reported every 15 minutes. A single FACILITY may have one or more METERS. > > SERVICE AGREEMENT. Each SERVICE AGREEMENT represents a billing arrangement for a single COMMODITY tracked by a single METER. An ACCOUNT can contain many SERVICE AGREEMENTS. Energy use data and billing data is tracked at the level of the SERVICE AGREEMENT. SERVICE AGREEMENTS also have an associated RATE CODE that determines how CUSTOMERS are billed for energy use. > > RATE CODE. A simple tag that indicates what rate structure is applied to ENERGY DATA to generate bills. > > ENERGY DATA. The simplest form of ENERGY DATA consists of a time stamp, a unit of measurement, and an amount. However, ENERGY DATA can be considerably more complex than this, including elements such as beginning and ending time stamps (for irregular periods of measurement), rate of use (as opposed to amount used), missing data text fields, etc. ## Features The tool will run in several distinct modes: * Metadata retrieval * Metadata update * Data retrieval Although these modes overlap somewhat, they are described separately below for the sake of clarity. ### Metadata retrieval Before retrieving ENERGY DATA, the tool must first retrieve all of the metadata associated with an ACCOUNT and set up corresponding database structures to receive the data. Although a complete entity-relation diagram will be provided at the start of the project, the basic metadata structure is described here. Items in {brackets} indicate a one-to-many relationship. > CUSTOMER (customer ID, customer name, {account}) > > ACCOUNT (account ID, utility, login, password, {service agreement}) > > SERVICE AGREEMENT (service agreement ID, meter ID, status, commodity type, energy data time interval, energy data unit, facility, service location address, service location description, time zone of data, rate code, {energy data}) > > FACILITY (facility ID, facility name, address, primary use, square footage, description) Note that FACILITY metadata information won't be available from the scraped web site. Rather, a human operator will have to enter this information separately. ### Metadata update The system should be able to retrieve the metadata for a given ACCOUNT and perform a comparison to the existing database, reporting any differences. For this version of the application, it is not necessary that the tool modify the metadata in the database to match the most up-to-date information, because such changes require a set of rules regarding how to propagate updates through the system. But it is necessary that the tool be able to detect and report on such changes. ### Data retrieval The tool must be able to retrieve energy use and billing data associated with SERVICE AGREEMENTS. Retrieval takes the following parameters: * Specific ACCOUNTS to retrieve data for (default is "all") * Specific ACCOUNTS to exclude (default is "none") * Specific SERVICE AGREEMENTS to retrieve data for (default is "all") * Specific SERVICE AGREEMENTS to exclude (default is "none") * Specific COMMODITIES to retrieve (default is "all") * Start date for retrieval. Tool should allow user to retrieve all new data since the last retrieval. * End date for retrieval (default is current date) * Merge type - if an individual piece of data is already present in the database, this flag indicates whether it should be updated according to the most recent retrieval, or whether the new value should be ignored (default is "ignored") ### Reporting After every metadata retrieval, metadata update, and data retrieval, the system should report summary statistics: * Complete list of ACCOUNT and SERVICE AGREEMENTS retrieved * Complete list of metadata changes, including additions, deletions, and modifications * For each SERVICE AGREEMENT, a tally of the number of energy use records retrieved. The system should also report any errors encountered: * Invalid login credentials * Failure to retrieve data * Invalid retrieval request (e.g., for a non-existent SERVICE AGREEMENT) Reports and error alerts should be recorded in a log on the server, and also made available via an RSS feed. ## Technical issues Because the data is being retrieved via web-scraping rather than via a well-specified API, some trial-and-error will be necessary. The following issues may arise: * Different pieces of metadata, energy use data, and billing data may live on different areas of the scraped web site, perhaps even requiring different login credentials. * The scraped web site may use session cookies, Javascript, frames, Flash, and other common web elements, complicating the data retrieval task. * It's possible (although unlikely for this initial project) that the scraped web site will deliver some information via emails that will have to be captured by the scraping system. * The scraped web site may have reliability issues, necessitating that the system automatically resubmit failed data requests. * The scraping tool may have to throttle requests to avoid overloading the scraped site. Other issues will undoubtedly arise in the course of building the scraper. ## Future enhancements Following successful completion of the scraping tool, we will likely seek several enhancements beyond the functionality described in this document. These are listed here because they may be useful in informing architectural decisions: * Capture of structured billing data, including pdf images of bills * Extension of the tool to other utilities * Browsing of metadata via a web-based interface * Entry of metadata (primarily related to CUSTOMERS, ACCOUNTS, and FACILITIES) via web-based interface) * Specification of data retrieval tasks (primary SERVICE AGREEMENTS and dates) via web-based interface * Automatic scheduling of data retrievals so that they can be queued and executed sequentially * Handling metadata changes in a way that provides a coherent view of energy use over time even when billing relationships change
IČ projektu: 3327273

O projektu

17 nabídky
Vzdálený projekt
Aktivní před 13 roky

Chcete si vydělat nějaké peníze?

Výhody podávání nabídek na Freelancer

Stanovte si rozpočet a časový rámec
Získejte za svou práci zaplaceno
Načrtněte svůj návrh
Registrace a podávání nabídek je zdarma
17 freelanceři nabízejí v průměru $278 USD za tuto práci
Avatar uživatele
See private message.
$127,50 USD v 2 dnech
5,0 (168 recenze)
6,8
6,8
Avatar uživatele
See private message.
$102 USD v 2 dnech
4,9 (353 recenze)
6,5
6,5
Avatar uživatele
See private message.
$212,50 USD v 2 dnech
5,0 (259 recenze)
6,4
6,4
Avatar uživatele
See private message.
$255 USD v 2 dnech
4,9 (76 recenze)
6,1
6,1
Avatar uživatele
See private message.
$1 445 USD v 2 dnech
4,8 (61 recenze)
6,1
6,1
Avatar uživatele
See private message.
$170 USD v 2 dnech
5,0 (71 recenze)
5,1
5,1
Avatar uživatele
See private message.
$208,25 USD v 2 dnech
4,9 (68 recenze)
5,0
5,0
Avatar uživatele
See private message.
$85 USD v 2 dnech
5,0 (10 recenze)
4,6
4,6
Avatar uživatele
See private message.
$102 USD v 2 dnech
5,0 (37 recenze)
4,6
4,6
Avatar uživatele
See private message.
$170 USD v 2 dnech
5,0 (12 recenze)
4,3
4,3
Avatar uživatele
See private message.
$127,50 USD v 2 dnech
5,0 (7 recenze)
3,5
3,5
Avatar uživatele
See private message.
$424,15 USD v 2 dnech
5,0 (1 recenze)
3,5
3,5
Avatar uživatele
See private message.
$340 USD v 2 dnech
5,0 (3 recenze)
3,3
3,3
Avatar uživatele
See private message.
$191,25 USD v 2 dnech
0,0 (0 recenze)
0,0
0,0
Avatar uživatele
See private message.
$382,50 USD v 2 dnech
0,0 (0 recenze)
0,0
0,0
Avatar uživatele
See private message.
$297,50 USD v 2 dnech
0,0 (1 recenze)
3,3
3,3
Avatar uživatele
See private message.
$85 USD v 2 dnech
0,0 (1 recenze)
0,0
0,0

O klientovi

Pochází z UNITED STATES
United States
5,0
7
Členem od kvě 19, 2011

Ověření klienta

Díky! Poslali jsme vám e-mailem odkaz pro získání kreditu zdarma.
Při odesílání e-mailu se něco pokazilo. Zkuste to prosím znovu.
Registrovaných uživatelů Zveřejněných projektů
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Načítání náhledu
Bylo uděleno povolení ke geolokaci.
Vaše doba přihlášení vypršela a byli jste odhlášeni. Přihlaste se znovu.