We recently applied this to a business problem that required an organization to quickly — and with no notice — stand up a website to collect hundreds, or potentially millions, of submissions from the general public. Our use case focused on law enforcement and the sorts of emergency response situations we’ve seen all too often in the news, such as the Boston Marathon bombing. When local, state or federal authorities respond to criminal acts, they seek to quickly collect vast amounts of input from the public. This input can be in the form of tips, photos, videos or any untold number of observations. Agencies need the capability to surge their IT tools and applications to collect the data, store it, and run analysis tools against the collected content to harvest information.
Solution Summary & Benefits
Our solution sought to achieve three objectives:
- Quickly provision a custom-branded website and publish it to the Internet within 15 minutes.
- Auto scale the environment based on public response.
- Allow the agency involved to leverage Big Data tools to find the really interesting and useful information.
- Minimize up-front capital investments.
The result is our Azure-powered proof of concept that allows non-IT staff to fill out a form and create a website that’s ready in under 15 minutes, auto scales on demand, leverages Big Data, and requires little upfront investment and no infrastructure costs. The site works on any device, supports any type of attachment, and can handle nearly “infinite load” without building out a massive IT infrastructure.
Note: Hosting this solution in your own datacenter would be like generating your own electricity – while it is certainly possible, it is far more expensive and less efficient than simply paying per kilowatt from an existing utility company that packages production, distribution and service. To ensure that we can utilize this approach for our clients, we built it out, loaded it with 25 million submissions, mashed it up with two other data sources, and created visual data analysis output in Excel.
How it Works
For our solution, we leveraged Microsoft Windows Azure, using Visual Studio and PowerShell. We selected Azure for our solution so we could leverage IaaS for storage and PaaS for the websites and Big Data solutions. By leveraging PaaS (vs. IaaS), we do not need to patch our servers or maintain them.
Users inside the agency log into the solution’s web portal, which is simply an ASP.NET Single Page Application (SPA) hosted as a Windows Azure Website. This website is designed to be left running at all times so it is ready when an event warrants its use. It’s a site with few users, little load (in fact, no load most days), and costs practically nothing to leave online. The user then completes a simple form that creates a website (including custom branding, if desired).
Once submitted, AIS-written code and PowerShell scripts create all the necessary resources on Azure. In about 15 minutes, two (or more) websites are created, load balanced across different data centers, and available to the public to use.
The websites are like the portal site in that they are SPAs running as Windows Azure Websites on the PaaS offering. High availability and load balancing is achieved through Azure Traffic Manager, which routes users to datacenter closest to the user, or an alternate location if the site is not available for unplanned reasons. As users submit data to the website, the data is stored on Azure Blob storage and is automatically replicated so both datacenters contain the full set of data. At last check, the cost of this storage was eight cents per GB, per month!
To simulate load, we uploaded 25 million records in our proof of concept site. Even under this tremendous load, our solution worked as designed because Azure scaled up our resources, then reduced them back down once we cut off our loading process. With 25 million records collected, we were now ready to start the real work: Looking for the ones that matter amongst all the noise.
To search through this vast data collection quickly, we applied Azure’s HDInsight solution. We provisioned a 16-node cluster to utilize parallel processing capabilities to quickly (under six minutes) execute a Hive query from Excel to our 25 million item data collection. The query returned a few hundred thousand rows of data that we wanted to examine closer and cross reference with other data collections. At this point, we stopped our HDInsight cluster (to avoid paying for this while we were not using it), and pointed our Excel workbook at two Open Government data sources on the Internet provided by the FBI and the Census Bureau to cross-reference our collected data with other datasets. Using PowerView and PowerMap in Excel, we quickly lined our collected data to the external data, and had animated maps and pivot table capabilities in the hands of our analysts.
Why it Matters
Being ready for a large-scale data collection website requirement that you know will come, but have no idea when, how often or for how long is a lot like planning for snow in southern states that rarely see more than an inch or two. Buying and maintaining a fleet of vehicles and equipment for the occasional foot of snow that comes every 20 years hardly makes sense. Planning, buying, and maintaining IT Infrastructure to support surge capacity is no longer required (or even for normal capacity for that matter).
For our clients, this solution ensures good stewardship of taxpayers’ dollars by only paying for the computing capacity that is actually used, utilizing agile reactive capability to respond to world and local events, drastically reducing operations and maintenance costs, and removes the nearly impossible tasks of scaling out infrastructure on demand during periods of intense public scrutiny.