How to Handle and Analyse Large Datasets BENVGEE7

How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21st February 2012 Introduction Me. BSc Geography, Worked as SABSCO ltd, niche power station construction contractor MSc GIS, MRes Energy Demand Studies PhD: The Spatiotemporal patterns of energy demand and supply in the UK Recent interest and research into large datasets including a major piece of research into the effects of disparate inaccurate datasets on energy demand forecast models

Email: [email protected] Web Linkedin: http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b UCL: http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp LoLo: http://www.lolo.ac.uk/profilepreview/view/id/102 Todays Lecture Three distinct sections 1. Theory: Describe how to handle and analyse large datasets 2. Practice: Run an exercise outlining some pervasive issues 3. Case Study: Demonstrate these within the context of some existing research Slides available on Moodle with web and literature references in full, colour denotes section. Part 1: What is a large dataset? Two types

Large volumes of data Millions of entries Many Terabytes Computationally intensive Past 10 years x 1m Varied sources of data Same variables Different sources Separate set of issues causing problems with handling and analysis

There are issues that are common between the two as well as individual Examples. Volumes Census (http://census.ac.uk/) Home Energy Efficiency Database (HEED http:// www.energysavingtrust.org.uk/Professional-r esources/Existing-Housing/Homes-Energy-Ef ficiency-Database

) Time series datasets e.g. energy production/consumption Remotely sensed data Geographic datasets Climate reanalyses Sources Population Economic variables (GDP, GVA etc.) Socio-demographic variables (Population, Employment etc.) Sources including repositories and search engines:

Data.gov: www.data.gov.uk GoGeo: www.gogeo.ac.uk ShareGeo: www.sharegeo.ac.uk Eurostat: http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/ IEA: www.iea.org National Statistics: www.statistics.gov.uk Odyssee: http://www.odyssee-indicators.org/ OECD: www.oecd.org

UNECE: www.unece.org World Bank: www.worldbank.org ADS, Archaeology Data Service; archaeologydataservice.ac.uk BADC, British Atmospheric Data Centre; badc.nerc.ac.uk BODC: (Oceanographic): www.bodc.ac.uk CDS, Chemical Database Service; cds.dl.ac.uk EBI, European Bioinformatics Institute; www.ebi.ac.uk ESDS, Economic and Social Data Service; www.esds.ac.uk NCDR, National Cancer Data Repository; www.ncin.org NGDC, National Geo-science Data Centre; www.ngdc.noaa.gov UKSSDC, UK Solar System Data Centre. www.ukssdc.ac.uk Office for national statistics: www.ons.gov.uk UK data archive (UKDA): www.data-archive.ac.uk Casweb (census): casweb.mimas.ac.uk DFT: www.dft.gov.uk EEA: www.eea.europe.eu World Energy Council: www.worldenergy.org Florida solar energy centre: www.fsec.ucf.edu/

EDINA: edina.ac.uk Mapcruzin: www.mapcruzin.com Guardian datastore: www.guardian.co.uk/data London air quality network: www.londonair.org.uk OpenStreetMap: www.openstreetmap.org UK Borders: edina.ac.uk/ukborders Met Office: www.metoffice.gov.uk DECC: www.decc.gov.uk Etc

Highlighted examples should be the most relevant to EDE Has anyone used large datasets before? 1. Yes 2. No 88% 12% 1 2 Does anyone think they will use it in the future? 1. Yes

2. No 3. Dont know 44% 38% 19% 1 2 3 Likely encounters

Access is predominantly through the web Some may require sign in through university Fees sometimes waived for academic use (always worth asking) Verify Copyright and Licensing Used in Research

Modelling Pervasive in the environmental domain Property Finance Volume and complexity are increasing (e.g. Facebook, Flickr) Mckinsey: concluded that the analysis of this kind of dataset will become increasingly important in influencing business decisions therefore skills in this area will be valuable Mckinsey: Big data: The next frontier for innovation, competition, and productivity Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontie r_for_innovation Storage:

Very large datasets require their own servers, especially those which require security e.g. HEED and OpenStreetMap Parallel storage allows download simultaneously with simulation, visualisation and analysis Hardware development means all but the very biggest can be stored and transported on portable hard drives Most can be downloaded via the internet or in special cases requested on a CD (e.g. Ordnance Survey Mastermap) Effective backup is necessary especially once analysis begins Bespoke data architecture exists (e.g. financial databases) This requires knowledge of primarily SQL

Most data that you encounter will be accessible through some sort of graphical interface Example on next slide Graphical interface SQL script Software and data format Use whatever you are comfortable with Excel OK for majority of operations, good graphically

Limited to 1 million rows and 16384 columns (beware when importing data) For larger datasets or more sophisticated operations consider a statistical packge SAS very good for large datasets but requires programming skill SPSS almost as powerful with a better interface Works well in conjunction with Field (2009) Microsoft Access allows handling of large complicated databases All of these available through cluster machines or for home use from http:// www.ucl.ac.uk/isd/common/software Alternatives include: R, Mathematica, Statistica and Rapidminer Formats

Excel (.xls, .xlsx) Access (.mdb, .dbf) SAS and SPSS have proprietary formats but can be exported to excel A common format used for exchange is comma separated (.CSV, .txt) Others include: xml (machine readable), CDF (NASA), NeXus, OpenMath, PDS, SAIF, SDTS, VICAR etc (these require some kind of specialist knowledge) Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd. Data Handling: First steps 1. Metadata

Data about data Attached in different ways Varies in forms and content Should follow standards e.g. INSPIRE http://inspire.jrc.ec.europa.eu/ 2. Identify methods of collection Are these uniform across data sources? May require reading supporting documentation 3. Identify contributors Are they reliable 4. Identify alternative sources

Case study will show that divergence is possible Data Handling: Second steps 5. Identify data gaps First do this visually Genuine gaps should not skew subsequent analysis If this has been replaced by for example NULL or 0.0 it may cause problems and should be investigated If several datasets are used this should be harmonised Follow a convention that is obvious to you and acceptable to the

software 6. Identify Duplicates More than one value for a data point Possibly valid E.g. shortened labels falsely groups values Data Handling: Second steps continued 7. Note precision

Data should be stored at a reasonable precision For example: Beware of the dataset that tries to depict population to the nearest person Harmonise between datasets Can affect comparability to other data 8. Identify spurious data Many rows and columns may not be needed Discard to make analysis simple Note changes Keep copies of original 9. Harmonise heading

Ensure that they make sense to you and the software Graphical representation and statistical analysis The above steps can be carried out by looking through a data However techniques exist to automate them and therefore reduce time The first step in any analysis should be to create graphs These can reveal patterns alongside highlighting duplicates, gaps and errors After this is done it may be useful to clean your data again Excel is fine but more complex and repeatable operations are available with other software and some programming Some examples.. A simple graph

Tufte (1983) and McCandless (2009) Something more complex Some better looking examples Statistical tests Another automated analysis technique is statistical These can be combined in a box plot conveying statistics graphically Simple metrics such as mean, median, mode and standard deviation are useful as well as looking at distribution

As well as the t test More sophisticated analysis through e.g. SPSS, GIS.. Advanced analysis, simulation and visualisation These methods vary based on purpose and available data If you have purely statistical intentions then something like SPSS or SAS is ideal, especially in conjunction with Field (2009) A multitude of tests exist which will suit your needs, beware that these depend on data type, collection etc. The internet along with books and lecturers are a good source for deciding which to choose A good program for visualisation, provided that you have spatially related data Some examples of output that I have produced are on the next slide, again there is an

abundance of web and literature resources GIS Part 2: Exercise Attempt to calculate the floor area of central house (this building) in pairs Stay in the room but use whatever techniques you have at your disposal No use of the internet (it will be obvious) Write your answer down on a piece of paper 10 minutes Be prepared to answer some questions using the poll system We will declare a floor area champion at the end What units did you use?

1. Acres 2. Hectares 3. Square Mile 4. Square Kilometre 5. Square Metre 6. Square foot 100% 0% 0% 1 2 0%

0% 3 4 0% 5 6 Why? Although the standard is m2 you should not assume that data you are given uses this standard Always check the metadata to ensure that it has been done correctly Remember that Americans will not use the metric system

and a large volume of data will originate from here Other units could well be correct but ensure that you use the data properly Did you include the basement in your calculations? 1. Yes 2. No 100% 0% 1 2 Why Floor area calculations can be defined as usable, in this

case the basement is used but someone creating a larger database would not have this information This can cause divergence between real data and that which you are provided with Check the metadata And if necessary at source Did you attempt to subtract the floor area of interior walls? 1. Yes 2. No 100% 0% 1

2 Why Alongside different ways of defining floor area (semantics) There are different ways of calculating it It is possible a dataset may have been formed from an Ordnance survey outline which would include them Or a building survey would not Neither is wrong but transparency is essential How many floors did you allow for? 1. 3 2. 4 3. 5 4. 6 5. 7 6. 8

7. 9 8. More 42% 26% 16% 0% 0% 0% 1 2

3 16% 0% 4 5 6 7 8 Why? The correct number is eight but this may not be clear from

plans Is the basement included in this? Did you allow for the light well in the centre of the building? 1. Yes 2. No 71% 29% 1 2 Why?

One method of calculating this would be to figure out the bottom floor and multiply it by the number of floors If you were unaware of the gap this may skew the result This type of error is common not only in floor area calculation but others that you may come across It is important to investigate and understand these sources of error What was your final answer in metres squared? 1. 0 750 2. 750 1500 3. 1500 2250 4. 2250 3000 5. 3000 3500 6. 3500 4000 7. 4000 4500 8. 4500 5000

9. More 26% 26% 21% 11% 11% 5% 1 0%

0% 0% 2 3 4 5 6 7 8

9 Conclusion: The Real answer was 3,658m2 39,376 sqft, 0.003658km2, 0.903949 Acre, 0.365815 hectare, 0.001412 mile2 Interestingly there is no DEC here so the figure is off the internet Different ways of defining the floor area have been used here as is the case for real datasets The reality is that the data you have created is probably as good an estimation of the floor area as is available publicly Errors would be multiplied if applied to for example the whole country which is a large dataset

Part 3: Research Case study: Assessing the availability and quality of data for tertiary sector energy demand forecast models Large number of separate datasets Divergence responsible for error of up to 100% Data Sources (UK only) Results Classification schemes NACE (Tertiary)

Wholesale & Retail Trade; repair of motor vehicles and motorcycles Accommodation and food service activities Financial, insurance and real estate activities Administrative and support service activities Education Human health and social work activities Other NACE activities ISIC (Commercial) Wholesale and Retail Trade; Repair of Motor Vehicles, Motorcycles and Personal and Household Goods

Hotels and Restaurants Real Estate, Renting and Business Activities Post and telecommunication, Financial Intermediation Education Health Miscellaneous Public administration and defence Agriculture, Forestry and Fishery (as separate sub sectors NACE: Nomenclature statistique des Activits conomiques dans la Communit Europenne (Eurostat, 2008) ISIC: United Nations International Standard Industrial Classifications (UNIDO, 2010) Results - Floor space in the sector Entire Nondomestic stock

All Commercial and Public buildings Questionable Difference Questionable Difference Tertiary sector Tertiary sector Results - Energy consumption in the sector Values from the ISIC scheme Values from the NACE scheme

Declining Range Results - Population Results - Employee numbers in the sector Values from the ISIC scheme Values from the NACE scheme Declining Range Same patterns as seen with the energy consumption data Results - Gross Domestic Product Clearly wrong (would this be obvious in isolation)

Results - Gross value added Values from the ISIC scheme Values from the NACE scheme Conclusions.. Theory conclusions: Data exists in many and varied forms Handling and analysis skills will become increasingly important There are a set of standard steps which should be followed in an initial exploration of any dataset Foremost in your mind should be viewing a dataset critically Visualisation is key to understanding Graphs etc. are generally the best way of

communicating information Research Case Study Conclusions Majority of error caused by lack of standard classification methodology Semantic differences exist but can be resolved Artefacts of harmonisation require care to eradicate Lack of transparency is pervasive Precision inextricably varies Variables with associated established methodology can be relied upon Many issues could be resolved through the setting up of a centralised

repository Data is dangerous References: Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd. Witten, I. H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann. Mccandless, D. 2009. Information is beautiful, Collins. Tufte, E. R. & Howard, G. 1983. The visual display of quantitative information, Graphics press Cheshire, CT. Mckinsey. 2011. Big data: The next frontier for innovation, competition, and productivity Available from: http:// www.mckinsey.com/Insights/MGI/Research/Technology_and_Innova tion/Big_data_The_next_frontier_for_innovation . Infrastructures, D. S. D. 2000. The SDI Cookbook. GSDI/Nebert. (for those interested in data infrastructure)

See also slide detailing data sources

Recently Viewed Presentations

  • CAPT REVIEW - New Haven Science

    CAPT REVIEW - New Haven Science

    CAPT REVIEW WHAT ARE THE BIG IDEAS IN EARTH/PHYSICAL SCIENCE? Matter Energy How they interact How do we use them? HOW DO WE OBSERVE MATTER? Senses Tools Physical Change is the same substance Chemical Change new substance (compound) WHAT MAKES...
  • CHAPTER FOUR - WordPress.com

    CHAPTER FOUR - WordPress.com

    3. They are above poriferans in the hierarchical order of development. They have double-cell layered tissues, and referred to as diploblastic organisms.
  • Title

    Title

    Duane Lefevre. Boston University School of Management. Adobe Connect is web conferencing software that allows you to conduct virtual classes. You can securely present PowerPoint slides, conduct discussions via voice and chat, bring guest speakers into the classroom virtually and...
  • ¿Cuál es la pauta (pattern)?

    ¿Cuál es la pauta (pattern)?

    Los QuehaceresPon en la forma PRETERITO Tú y hazlo (makeit) una pregunta. Pasar la aspiradora Arreglar la sala Cortar el césped Limpiar el baño . Preparar la cena Sacar la basura. Cuidar a los niños (babysit)
  • Typical plain page format - Aircargopedia

    Typical plain page format - Aircargopedia

    Cargo Qualifications. Belgrade Cargo and Logistics Center. Feasibility Study . SH&E, jointly with a consortium of lawyers , bankers and engineers, advised the Serbian government on the concessioning of the several projects at Belgrade International Airport to the private sector.
  • Linking London: City & Guilds -update Geoff Holden

    Linking London: City & Guilds -update Geoff Holden

    Geoff Holden November 2012. Any young person's programme of study, whether 'academic' or vocational', should provide for personal, career and educational progress on a wide front.
  • Analyzing and Visualizing Disaster Phases from Social Media

    Analyzing and Visualizing Disaster Phases from Social Media

    Motivation. CTRnet: archiving disaster-related online data in collaboration with the Internet Archive. Tweets during disasters: quick alternative to cell phones
  • Deductive and Inductive Reasoning Mimi Opkins CECS 100

    Deductive and Inductive Reasoning Mimi Opkins CECS 100

    Problem Solving. Logic - The science of correct reasoning.. Reasoning - The drawing of inferences or conclusions from known or assumed facts. When solving a problem, one must understand the question, gather all pertinent facts, analyze the problem i.e. compare...