LATEST JOBS

Data Warehousing Interview questions

By sathesh on 8:38 AM

Filed Under:

Data Warehousing Interview questions

What is Data warehousing?

A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.
Data warehousing is a process of repository of electronic data of an organization. For the purpose of reporting and analysis, data warehousing is used. The essence concept of data warehousing is to provide data flow of architectural model from operational system to decision support environments.

What are fact tables and dimension tables?
As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical.
On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table.
e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details.
Hence, the relation between a fact and dimension table is one to many.
Business facts or measures and foreign keys are persisted in fact tables which are referred as candidate keys in dimension tables. Additive values are usually provided by the fact tables which acts as independent variables by which dimensional attributes are analyzed.
Attributes that are used to constrain and group data for performing data warehousing queries are persisted in the dimension tables.

What is ETL process in data warehousing?
ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs.
ETL stands for Extraction, transformation and loading. That means extracting data from different sources such as flat files, databases or XML data, transforming this data depending on the application’s need and loads this data into data warehouse.


Explain the difference between data mining and data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.
Data mining is a method for comparing large amounts of data for the purpose of finding patterns. Data mining is normally used for models and forecasting. Data mining is the process of correlations, patterns by shifting through large data repositories using pattern recognition techniques.
Data warehousing is the central repository for the data of several business systems in an enterprise. Data from various resources extracted and organized in the data warehouse selectively for analysis and accessibility.

What is an OLTP system and OLAP system?
OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network.
OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional).
OLTP stands for OnLine Transaction Processing. Applications that supports and manges transactions which involve high volumes of data are supported by OLTP system. OLTP is based on client-server architecture and supports transactions across networks.
OLAP stands for OnLine Analytical Processing. Business data analysis and complex calculations on low volumes of data are performed by OLAP. An insight of data coming from various resources can be gained by a user with the support of OLAP.

What are cubes?
A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.
Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. OLAP environments view the data in the form of hierarchical cube. A cube typically includes the aggregations that are needed for business intelligence queries.

What is snow flake scheme design in database?
A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table.
E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly.
Snow flake schema is one of the designs that are present in database design. Snow flake schema serves the purpose of dimensional modeling in data warehousing. If the dimensional table is split into many tables, where the schema is inclined slightly towards normalization, then the snow flake design is utilized. It contains joins in depth. The reason is that, the tables split further.

What is analysis service?
Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining.
An integrated view of business data is provided by analysis service. This view is provided with the combination of OLAP and data mining functionality. Analysis Services allows the user to utilize a wide variety of data mining algorithms which allows the creation and designing data mining models.
Explain sequence clustering algorithm.
Sequence clustering algorithm collects similar or related paths, sequences of data containing events.
E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house.
Explain discrete and continuous data in data mining.
Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender.
Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age.
Finite data can be considered as discrete data. For example, employee id, phone number, gender, address etc.
If data changes continually, then that data can be considered as continuous data. For example, age, salary, experience in years etc.

Explain time series algorithm in data mining.
Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series.
E.g. Performance one employee can influence or forecast the profit

What is XMLA?
XMLA is XML for Analysis which can be considered as a standard for accessing data in OLAP, data mining or data sources on the internet. It is Simple Object Access Protocol. XMLA uses discover and Execute methods. Discover fetched information from the internet while Execute allows the applications to execute against the data sources.
XMLA stands for XML for Analysis. It is an industry standard for accessing data in analytical systems, such as OLAP. XMLA is based on XML, SOAP and HTTP.
Explain the difference between Data warehousing and Business Intelligence.
Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc.
Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born.
The management of different aspects like development, implementation and operation of a data warehouse is dealt by data warehousing. It also manages the meta data, data cleansing, data transformation, data acquisition persistence management, archiving data.
In business intelligence the organization analyses the measurement of aspects of business such as sales, marketing, efficiency of operations, profitability, and market penetration within customer groups. The typical usage of business intelligence is to encompass OLAP, visualization of data, mining data and reporting tools.

What is Dimensional Modeling?
Dimensional modeling is often used in Data warehousing. In simpler words it is a rational or consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling.
Dimensional modeling is one of the logical design techniques used in data warehousing. It is different from entity-relationship model. If applied to relational databases, and done properly, it is 2nd or 3rd normal form. It does not necessarily involve relational database. The logical level of modeling approach can be applied in physical form like database tables or flat files. It is one of the techniques for the support of end-user queries in data warehousing. On contrary to database administration, it is oriented around understandability.

What is surrogate key? Explain it with an example.
Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity.
E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data.
A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in the database. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary key.
For example, a sequential number can be a surrogate key.

What is the purpose of Factless Fact Table?
Fact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event.
Eg. To find the number of leaves taken by an employee in a month.
A tracking process or collecting status can be performed by using fact less fact tables. The fact table does not have numeric values that are aggregate, hence the name. Mere key values that are referenced by the dimensions, from which the status is collected, are available in fact less fact tables.

What is a level of Granularity of a fact table?
A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.
E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.
The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity.
The process consists of the following two steps:
- Determining the dimensions that are to be included
- Determining the location to place the hierarchy of each dimension of information
The factors of determination will be resent to the requirements.

Explain the difference between star and snowflake schemas.
A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema.
E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager
Snow Flake Schema: the dimension tables can be broken to performance_report_employee_weekly, monthly etc.
Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with numerous dimensions table and depicts a star.
Snow flake schema: The normalized principles applied star schema is known as Snow flake schema. Every dimension table is associated with sub dimension table.
Differences:
A dimension table will not have parent table in star schema, whereas snow flake schemas have one or more parent tables.
The dimensional table itself consists of hierarchies of dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done.
A snowflake schema is a more normalized form of a star schema. In a star schema, one fact table is stored with a number of dimension tables. On the other hand, in a star schema, one dimension table can have multiple sub dimensions. This means that in a star schema, the dimension table is independent without any sub dimensions.

What is the difference between view and materialized view?
A view is created by combining data from different tables. Hence, a view does not have data of itself.
On the other hand, Materialized view usually used in data warehousing has data. This data helps in decision making, performing calculations etc. The data stored by calculating it before hand using queries.
When a view is created, the data is not stored in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored.
View:
Tail raid data representation is provided by a view to access data from its table.
It has logical structure can not occupy space.
Changes get affected in corresponding tables.
Materialized view
Pre calculated data persists in materialized view.
It has physical data space occupation.
Changes will not get affected in corresponding tables.

What is a Cube and Linked Cube with reference to data warehouse?
A data cube stores data in a summarized version which helps in a faster analysis of data. Where as linked cubes use the data cube and are stored on another analysis server. Linking different data cubes reduces the possibility of sparse data.
E.g. A data cube may store the Employee_performance. However in order to know the hours which calculated this performance, one can create another cube by linking it to the root cube (in this case employee_performance).
Logical data representation of multidimensional data is depicted as a Cube. Dimension members are represented by the edge of cube and data values are represented by the body of cube.
Linked cubes are the cubes that are linked in order to make the data remain constant.

What is junk dimension?
In scenarios where certain data may not be appropriate to store in the schema, this data (or attributes) can be stored in a junk dimension. The nature of data of junk dimension is usually Boolean or flag values.
E.g. whether the performance of employee was up to the mark? , Comments on performance.
A single dimension is formed by lumping a number of small dimensions. This dimension is called a junk dimension. Junk dimension has unrelated attributes. The process of grouping random flags and text attributes in dimension by transmitting them to a distinguished sub dimension is related to junk dimension.

What are fundamental stages of Data Warehousing?
Stages of a data warehouse helps to find and understand how the data in the warehouse changes.
At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected.
In the next evolving stage, the data in the warehouse is updated regularly using the source data.
In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket)
When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data.
Offline Operational Databases: This is the initial stage of data warehousing. In this stage the development of database of an operational system to an off-line server is done by simply copying the databases.
Offline Data warehouse: In this stage the data warehouses are updated on a regular time cycle from operational system and the data is persisted in an reporting-oriented data structure.
Real time Data Warehouse: Data warehouses are updated based on transaction or event basis in this stage. An operational system performs a transaction every time.
Integrated Data Warehouse: The activity or transactions generation which are passed back into the operational system is done in this stage. These transactions or generated transactions are used in the daily activity of the organization.

What is Virtual Data Warehousing?
The aggregate view of complete data inventory is provided by Virtual Warehousing. The metadata is utilized for forming logical enterprise data model which is a part of database of record infrastructure , is contained in virtual data warehousing. The infrastructure consists of publishments of legacy database sysems with their metadta extracted. The standards JEE, JMS and EJBs are used in the infrastructure for the purpose of transactional unit requests and extract-tranform-load tools are used for loading real time bulk data.
A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications.

What is active data warehousing?
The transactional data captured and reposited in the Active Data Warehouse. This repository can be utilized in finding trends and patterns that can be used in future decision making.
An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems.


List down differences between dependent data warehouse and independent data warehouse.
Dependent data ware house are build ODS,where as independent data warehouse will not depend on ODS.
A dependent data warehouse stored the data in a central data warehouse. On the other hand independent data warehouse does not make use of a central data warehouse.

What is data modeling and data mining? What is this used for?
Designing a model for data or database is called data modelling. Data is reposited in fact table and dimension table. Fact table consists of data about transaction and dimensional table consists of master data. Data model is used to design abstract model of database.
The process of obtaining the hidden trends is called as data mining. Data mining is used to transform the hidden into information. Data mining is also used in a wide range of practicing profiles such as marketing, surveillance, fraud detection.
Data modeling aims to identify all entities that have data. It then defines a relationship between these entities. Data models can be conceptual, logical or Physical data models. Conceptual models are typically used to explore high level business concepts in case of stakeholders. Logical models are used to explore domain concepts. While Physical models are used to explore database design.
Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Data mining helps in reporting, planning strategies, finding meaningful patterns etc. it can be used to convert a large amount of data into a sensible form.

Difference between ER Modeling and Dimensional Modeling.
Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of normalization of data into denormalized form.
ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as dimensional data model is used for data warehousing and uses 3rd normal form.
ER model contains normalized data where as Dimensional model contains denormalized data.
ER modeling that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.


What is snapshot with reference to data warehouse?
A snapshot of data warehouse is a persisted report from the catalogue. The persistence into a file is done after disconnecting report from the catalogue.
A snapshot is in a data warehouse can be used to track activities. For example, every time an employee attempts to change his address, the data warehouse can be alerted for a snapshot. This means that each snap shot is taken when some event is fired.
A snapshot has three components –
Time when event occurred.
A key to identify the snap shot.
Data that relates to the key.

What is degenerate dimension table?
The dimensions that are persisted in the fact table is called dimension table. These dimensions does not contain its own dimensions. Mapping does not take place for the columns available in fact tables. The values in the table is neither dimensions nor measures.
A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.
E.g. employee_id

What is Data Mart?
Data Mart is a data repository which is served to a community of people who works on knowledge (also known as knowledge workers). The data resource can be from enterprise resources or from a data warehouse.
Data mart stores particular data that is gathered from different sources. Particular data may belong to some specific community (group of people) or genre. Data marts can be used to focus on specific business needs.

What is the difference between metadata and data dictionary?
Metadata describes about data. It is ‘data about data’. It has information about how and when, by whom a certain data was collected and the data format. It is essential to understand information that is stored in data warehouses and xml-based web applications.
Data dictionary is a file which consists of the basic definitions of a database. It contains the list of files that are available in the database, number of records in each file, and the information about the fields.
Data dictionary is a repository to store all information. Meta data is data about data. Meta data is data that defines other data. Hence, the data dictionary can be metadata that describes some information about the database.

Describe the various methods of loading Dimension tables.
The following are the methods of loading dimension tables:
Conventional Load:
In this method all the table constraints will be checked against the data, before loading the data.
Direct Load or Faster Load:
As the name suggests, the data will be loaded directly without checking the constraints. The data checking against the table constraints will be performed later and indexing will not be done on bad data.
The methods to load Dimension tables:
Conventional load:- Here the data is checked for any table constraints before loading.
Direct or Faster load:- The data is directly loaded without checking for any constraints.

What is the difference between OLAP and data warehouse?
The following are the differences between OLAP and data warehousing:
Data Warehouse
Data from different data sources is stored in a relational database for end use analysis.
Data organization is in the form of summarized, aggregated, non volatile and subject oriented patterns.
Supports the analysis of data but does not support data of online analysis.
Online Analytical Processing
With the usage of analytical queries, data is analyzed and evaluated in the data ware house.
Data aggregation and summarization is utilized to organize data using multidimensional models.
Speed and flexibility for online data analysis is supported for data analyst in real time environment.
A data warehouse serves as a repository to store historical data that can be used for analysis. OLAP is Online Analytical processing that can be used to analyze and evaluate data in a warehouse. The warehouse has data coming from varied sources. OLAP tool helps to organize data in the warehouse using multidimensional models.

Describe the foreign key columns in fact table and dimension table.
The primary keys of entity tables are the foreign keys of dimension tables.
The Primary keys of fact dimensional table are the foreign keys of fact tables.
A foreign key of a fact table references other dimension tables. On the other hand, dimension table being a referenced table itself, having foreign key reference from one or more tables.

What is cube grouping?
A transformer built set of similar cubes is known as cube grouping. A single level in one dimension of the model is related with each cube group. Cube groups are generally used in creating smaller cubes that are based on the data in the level of dimension.

Define the term slowly changing dimensions (SCD).
Slowly changing dimension target operator is one of the SQL warehousing operators that can be used in mining flow or in data flow.
When the attribute for a record varies over time, the SCD is applied.
SCD are dimensions whose data changes very slowly. An example of this can be city of an employee. This dimension will change very slowly. The row of this data in the dimension can be either replaced completely without any track of old record OR a new row can be inserted, OR the change can be tracked.

What is a Star Schema?
The simplest data warehousing schema is star schema. It consists of fact tables that refer any number of dimension tables. It is the special case schema to be considered for snowflake schema.
In a star schema comprises of fact and dimension tables. Fact table contains the fact or the actual data. Usually numerical data is stored with multiple columns and many rows. Dimension tables contain attributes or smaller granular data. The fact table in start schema will have foreign key references of dimension tables.
Differences between star and snowflake schema.
Star Schema: A de-normalized technique in which one fact table is associated with several dimension tables. It resembles a star.
Snow Flake Schema: A star schema that is applied with normalized principles is known as Snow flake schema. Every dimension table is associated with sub dimension table.


Explain the use of lookup tables and Aggregate tables.
At the time of updating the data warehouse, a lookup table is used. When placed on the fact table or warehouse based upon the primary key of the target, the update is takes place only by allowing new records or updated records depending upon the condition of lookup.
The materialized views are aggregate tables. It contains summarized data. For example, to generate sales reports on weekly or monthly or yearly basis instead of daily basis of an application, the date values are aggregated into week values, week values are aggregated into month values and month values into year values. To perform this process @aggregate function is used.
An aggregate table contains summarized view of data. Lookup tables, using the primary key of the target, allow updating of records based on the lookup condition.

What is real time data-warehousing?
The combination of real-time activity and data warehousing is called real time warehousing. The activity that happens at current time is known as real-time activity. Data is available after completion of the activity.
Business activity data is captured in real-time data warehousing as the data occurs. Soon after the business activity and the available data, the data of completed activity is flown into the data warehouse. This data is available instantly. Real-time data warehousing can be viewed / utilized as a framework for the information retrieval from data as the data is available.
In real time data-warehousing, the warehouse is updated every time the system performs a transaction. It reflects the businesses real time information. This means that when the query is fired in the warehouse, the state of the business at that time will be returned.

What is conformed fact? What is conformed dimensions use for?
Allowing having same names in different tables is allowed by Conformed facts. The combining and comparing facts mathematically is possible.
A dimensional table can be used more than one fact table is referred as conformed dimension. It is used across multiple data marts along with the combination of multiple fact tables. Without changing the metadata of conformed dimension tables, the facts in an application can be utilized without further modifications or changes.
Conformed fact in a warehouse allows itself to have same name in separate tables. They can be compared and combined mathematically. Conformed dimensions can be used across multiple data marts. These conformed dimensions have a static structure. Any dimension table that is used by multiple fact tables can be conformed dimensions.

Define non-additive facts.
The facts that can not be summed up for the dimensions present in the fact table are called non-additive facts. The facts can be useful if there are changes in dimensions. For example, profit margin is a non-additive fact for it has no meaning to add them up for the account level or the day level.
Non additive facts are facts that cannot be summed up for any dimensions present in fact table. This means that these columns cannot be added for producing any results.

Define BUS Schema.
A BUS schema is to identify the common dimensions across business processes, like identifying conforming dimensions. BUS schema has conformed dimension and standardized definition of facts.
List out difference between SAS tool and other tools.
The differences between SAS and other tools are:
-SAS is a reporting tool.
-SAS is an ETL tool and also a forecasting tool.
Tools other than SAS
- consists of reporting tool, for example, Business Objects Cognos or ETL tool, for example, Informatica, or both , for example Business Objects.
Other tools does not have forecasting tool. For this reason, SAS is used in most in Clinical Trials and health care industry.
SAS provides more features in comparison to other tools. it supports almost ALL database interfaces and has its own extensive database engine.

Why is SAS so popular?
Statistical Analysis System is an integration of various software products which allows the developers to perform
Data entry, data retrieval, data management and data mining
Report writing and supports for graphics
Statistical analysis, business planning, business forecasting and business decision support
Operations research and project management, quality improvement, application development
Extract, transform and load functions in data warehousing.
Platform independent and remote computing
Because of these many features, SAS has become more and more popular.
SAS is an ETL tool. Not just this it can be used for reporting and can be used for forecasting business needs.

What is data cleaning? How can we do that?
Data cleaning is also known as data scrubbing. Data cleaning is a process which ensures the set of data is correct and accurate. Data accuracy and consistency, data integration is checked during data cleaning. Data cleaning can be applied for a set of records or multiple sets of data which need to be merged.
Data cleaning is performed by reading all records in a set and verifying their accuracy. Typos and spelling errors are rectified. Mislabeled data if available is labeled and filed. Incomplete or missing entries are completed. Unrecoverable records are purged, for not to take space and inefficient operations.
Data cleaning is the process of identifying erroneous data. The data is checked for accuracy, consistency, typos etc.
Methods:-
Parsing - Used to detect syntax errors.
Data Transformation - Confirms that the input data matches in format with expected data.
Duplicate elimination - This process gets rid of duplicate entries.
Statistical Methods- values of mean, standard deviation, range, or clustering algorithms etc are used to find erroneous data.

Explain in brief about critical column.
A column (usually granular) is called as critical column which changes the values over a period of time.
For example, there is a customer by name ‘Anirudh’ who resided in Bangalore for 4 years and shifted to Pune. Being in Bangalore, he purchased Rs. 30 Lakhs worth of purchases. Now the change is the CITY in the data warehouse and the purchases now will shown in the city Pune only. This kind of process makes data warehouse inconsistent. In this example, the CITY is the critical column. Surrogate key can be used as a solution for this.
A critical column in a warehouse is a column whose value changes over a period of time. For e.g. city of the user. If a user resides in city 'abc' and the warehouse keeps a track of his per day expenses - when the user changes the city, the data warehouse becomes inconsistent since the city has changed and the expenses are shown under the new city.


What is data cube technology used for?
Data cube is a multi-dimensional structure. Data cube is a data abstraction to view aggregated data from a number of perspectives. The dimensions are aggregated as the ‘measure’ attribute, as the remaining dimensions are known as the ‘feature’ attributes. Data is viewed on a cube in a multidimensional manner. The aggregated and summarized facts of variables or attributes can be viewed. This is the requirement where OLAP plays a role.
Data cubes are commonly used for easy interpretation of data. It is used to represent data along with dimensions as some measures of business needs. Each dimension of the cube represents some attribute of the database. E.g Sales per day, month or year.

What is Data Scheme?
Data Scheme is a diagrammatic representation that illustrates data structures and data relationships to each other in the relational database within the data warehouse.
The data structures have their names defined with their data types.
Data Schemes are handy guides for database and data warehouse implementation.
The Data Scheme may or may not represent the real lay out of the database but just a structural representation of the physical database.
Data Schemes are useful in troubleshooting databases.

What is Bit Mapped Index?
Bitmap indexes make use of bit arrays (bitmaps) to answer queries by performing bitwise logical operations.
They work well with data that has a lower cardinality which means the data that take fewer distinct values.
Bitmap indexes are useful in the data warehousing applications.
Bitmap indexes have a significant space and performance advantage over other structures for such data.
Tables that have less number of insert or update operations can be good candidates.
The advantages of Bitmap indexes are:
They have a highly compressed structure, making them fast to read.
Their structure makes it possible for the system to combine multiple indexes together so that they can access the underlying table faster.
The Disadvantage of Bitmap indexes is:
The overhead on maintaining them is enormous.

What is Bi-directional Extract?
In hierarchical, networked or relational databases, the data can be extracted, cleansed and transferred in two directions. The ability of a system to do this is refered to as bidirectional extracts.
This functionality is extremely useful in data warehousing projects.
Data Extraction
The source systems the data is extracted from vary in various forms right from their structures and file formats to the department and the business segment they belong to. Common source formats include flat files and relational database and other non-relational database structures such as IMS, VSAM or ISAM.
Data transformation
The extracted data may undergo transformation with possible addition of metadata before they are exported to another large storage area.
In transformation phase, various functions related to business needs, requirements, rules and policies are applied on them. During this process some values even get translated and encoded. Care is also taken to avoid redundancy of data.
Data Cleansing
In data cleansing, scrutinizing of the incorrect or corrupted data is done and those inaccuracies are removed. Thus data consistency is ensured in Data cleansing.
It involves activities like
- removing typographical errors and inconsistencies
- comparing and validating data entries against a list of entities
Data transformation
This is the last process of Bidirectional Extracts. The cleansed, transformed extracted source data is then loaded into the data warehouse.
Advantages
- Updates and data loading become very fast due to bidirectional extracting.
- As timely updates are received in a useful pattern companies can make good use of this data to launch new products and formulate market strategies.
Disadvantage
- More investment on advance and faster IT infrastructure.
- Not being able to come up with fault tolerance may mean unexpected stoppage of operations when the system breaks.
- Skilled data administrator needs to be hired to manage the complex process.

What is Data Collection Frequency?
Data collection frequency is the rate at which data is collected. However, the data is not just collected and stored. it goes through various stages of processing like extracting from various sources, cleansing, transforming and then storing in useful patterns.
It is important to have a record of the rate at which data is collected because of various reasons:
Companies can use these records to keep a track of the transactions that have occurred. Based on these records the company can know if any invalid transactions ever occurred.
In scenarios where the market changes rapidly, companies need very frequently updated data to enable them make decisions based on the state of the market and then invest appropriately.
A few companies keep launching new products and keep updating their records so that their customers can see them which would in turn increase their business.
When data warehouses face technical problems, the logs as well as the data collection frequency can be used to determine the time and cause of the problem.
Due to real time data collection, database managers and data warehouse specialists can make more room for recording data collection frequency.

What is Data Cardinality?
Cardinality is the term used in database relations to denote the occurrences of data on either side of the relation.
There are 3 basic types of cardinality:
High data cardinality:
Values of a data column are very uncommon.
e.g.: email ids and the user names
Normal data cardinality:
Values of a data column are somewhat uncommon but never unique.
e.g.: A data column containing LAST_NAME (there may be several entries of the same last name)
Low data cardinality:
Values of a data column are very usual.
e.g.: flag statuses: 0/1
Determining data cardinality is a substantial aspect used in data modeling. This is used to determine the relationships
Types of cardinalities:
The Link Cardinality - 0:0 relationships
The Sub-type Cardinality - 1:0 relationships
The Physical Segment Cardinality - 1:1 relationship
The Possession Cardinality - 0: M relation
The Child Cardinality - 1: M mandatory relationship
The Characteristic Cardinality - 0: M relationship
The Paradox Cardinality - 1: M relationship.

What is Chained Data Replication?
In Chain Data Replication, the non-official data set distributed among many disks provides for load balancing among the servers within the data warehouse.
Blocks of data are spread across clusters and each cluster can contain a complete set of replicated data. Every data block in every cluster is a unique permutation of the data in other clusters.
When a disk fails then all the calls made to the data in that disk are redirected to the other disks when the data has been replicated.
At times replicas and disks are added online without having to move around the data in the existing copy or affect the arm movement of the disk.
In load balancing, Chain Data Replication has multiple servers within the data warehouse share data request processing since data already have replicas in each server disk.

What are Critical Success Factors?
Key areas of activity in which favorable results are necessary for a company to reach its goal.
There are four basic types of CSFs which are:
Industry CSFs
Strategy CSFs
Environmental CSFs
Temporal CSFs
A few CSFs are:
Money
Your future
Customer satisfaction
Quality
Product or service development
Intellectual capital
Strategic relationships
Employee attraction and retention
Sustainability
The advantages of identifying CSFs are:
they are simple to understand;
they help focus attention on major concerns;
they are easy to communicate to coworkers;
they are easy to monitor;
and they can be used in concert with strategic planning methodologies.
What is Data Warehousing?
A data warehouse can be considered as a storage area where interest specific or relevant data is stored irrespective of the source. What actually is required to create a data warehouse can be considered as Data Warehousing. Data warehousing merges data from multiple sources into an easy and complete form.


What is Virtual Data Warehousing?
A virtual data warehouse provides a collective view of the completed data. A virtual data warehouse has no historic data. It can be considered as a logical data model of the containing metadata.
Explain in brief various fundamental stages of Data Warehousing.
Stages of a data warehouse helps to find and understand how the data in the warehouse changes.
At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected.
In the next evolving stage, the data in the warehouse is updated regularly using the source data.
In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket)
When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data.

What is active data warehousing?
An active data warehouse represents a single state of the business. Active data warehousing considers the analytic perspectives of customers and suppliers. It helps to deliver the updated data through reports.

What is data modeling and data mining? What is this used for?
Data Modeling is a technique used to define and analyze the requirements of data that supports organization’s business process. In simple terms, it is used for the analysis of data objects in order to identify the relationships among these data objects in any business.
Data Mining is a technique used to analyze datasets to derive useful insights/information. It is mainly used in retail, consumer goods, telecommunication and financial organizations that have a strong consumer orientation in order to determine the impact on sales, customer satisfaction and profitability. Data Mining is very helpful in determining the relationships among different business attributes.

Difference between ER Modeling and Dimensional Modeling
The entity-relationship model is a method used to represent the logical flow of entities/objects graphically that in turn create a database. It has both logical and physical model. And it is good for reporting and point queries.
Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. It has only physical model. It is good for ad hoc query analysis.

What is the difference between data warehousing and business intelligence?
Data warehousing relates to all aspects of data management starting from the development, implementation and operation of the data sets. It is a back up of all data relevant to business context i.e. a way of storing data
Business Intelligence is used to analyze the data from the point of business to measure any organization’s success. The factors like sales, profitability, marketing campaign effectiveness, market share and operational efficiency etc are analyzed using Business Intelligence tools like Cognos, Informatica, SAS etc.
Describe dimensional Modeling.
Dimensional model is a method in which the data is stored in two types of tables namely facts table and dimension table. Fact table comprises of information to measure business successes and the dimension table comprises of information on which the business success is calculated. It is mainly used by data warehouse designers to build data warehouses. It represents the data in a standard and sequential manner that triggers for high performance access.

What is snapshot with reference to data warehouse?
Snapshot refers to a complete visualization of data at the time of extraction. It occupies less space and can be used to back up and restore data quickly.

0 comments for this post

Post a Comment