The findings of testing online scraping methods in the area of consumer pricing surveys, with an emphasis on consumer electronics devices (goods) and airfares, are presented in this study (services). The work of the Italian National Statistical Institute (Istat) in the European project "Multipurpose Price Statistics" serves as a starting point for the article (MPS).
The modernisation of data collecting and the usage of online scraping methods are two of the themes addressed by MPS.
The problem of quality (in terms of efficiency and error reduction) is discussed, as well as some early observations on the usefulness of large data for statistical purposes. The paper's overall goals are outlined in the introduction . The selection of items to evaluate web scraping processes. The findings and/or concerns of testing online scraping methods are relayed , after a description of the survey for consumer electronics and airfares. Section 5 focuses on some observations concerning probable quality improvements resulting from web scraping for inflation measurements. Some concluding observations are made with a special focus on the topic of big data. The centralised collection of consumer pricing in Italy, as well as the IT systems used for web scraping, are provided in two fact boxes. One of the foundations of the European Commission's "Multipurpose Price Statistics" (MPS) programme is the modernization of data collecting technologies to improve the quality of the Harmonised Index of Consumer Prices (HICP). The use of electronic devices to gather pricing information in the field, using scanner data as a source for inflation predictions, and expanding the deployment of web scraping methods to scrape data from the web for HICP compilation are the three primary elements of modernising data collecting.
In its modernisation of data gathering, the Italian National Statistical Institute (Istat) is actively focusing on all three of these qualities. A group of statisticians and IT specialists specifically examined the use of online scraping methods in consumer pricing surveys, concentrating on two product groups: "consumer electronics" (goods) and "airfares" (services). For each of these types of items, Web Scraping Services have been designed and tested. The use of online scraping methods just to improve the efficiency of the present survey yields a few benefits in terms of time savings without increasing the data gathering. However, improving and maintaining specific macros in order to extend web scraping techniques to the entire data collection of airfares on the web could improve quality (more automated and controlled production process) and allow Istat data collectors to move on to other tasks, even if only for a few hours.
The experiments were undertaken in the context of the current situation of the Italian consumer price survey, in which a portion of the data collection is already done centrally by Istat (with more than 21% of the basket of items in terms of weights) and on the internet.
As a result, the goal of these new efforts was twofold: on the one hand, understanding the on quality in terms of reducing the error of the measures obtained, as well as in terms of efficiency and costs of the consumer price survey; on the other hand, exploring and analysing the issues that arose from the use of web scraping opened the way to some thoughts and remarks about the use of web scraping to access "big data" on the web for the measurement of inflation.
Despite the fact that the work done so far was done within a short period of time, as you will see in the following, it allowed us to accomplish considerable, if not decisive, outcomes in terms of new efficiency. There is much more to be done. For example, we've just recently begun examining potential improvements in survey quality, with some early remarks on the use of big data and its implications for survey design. In response to a given query, the Meridiana website displayed extra pages providing optional services or requesting passenger information before presenting the final price, forcing us to use a unique and more sophisticated macro to scrape costs.
Finally, EasyJet was the centre of attention, and the macros built produced great results in accurately reproducing human data gathering; nonetheless, time savings advantages were minimal. This is due to the time spent preparing the input files, which are used by the macros to correctly identify the routes and dates for which scraping the prices and returning a correct output usable for index compilation; but also due to the limited amount of elementary quotes involved (60), only a limited time saving could be derived from the adoption of web scraping techniques, despite the fact that this was a powerful tool for acquiring large amounts of elementary data in an accelerated manner. Then, for 160 monthly price quotations, online scraping methods for conventional airline company airfares were used on the web agency Opodo (www.opodo.it). The macro's outcomes were assessed by comparing the efficiency and coherence of the data manually downloaded to the macro's findings.
In terms of efficiency gains, the results produced are similarly fairly minimal in this scenario. In the most recent test on monthly data collection, Opodo macro required 1 hour and 48 minutes to get 160 basic price quotations that took 2 hours and half to download manually. In order to drive the Opodo macro in finding the right sample of routes and, in addition to the Easyjet macro, in managing the difference between conventional and low-cost carriers, an input file must be prepared. As a result, the time needed for automated price detection is not dissimilar to that required for manual price detection.