Q. What's A Data Warehouse?
A1.
A Data warehouse is a repository of integrated information, available
for queries and analysis. Data and information are extracted from
heterogeneous sources as they are generated. This makes it much easier
and more efficient to run queries over data that originally came from
different sources". Another definition for data warehouse is: " A data
warehouse is a logical collection of information gathered from many
different operational databases used to create business intelligence
that supports business analysis activities and decision-making tasks,
primarily, a record of an enterprise's past transactional and
operational information, stored in a database designed to favour
efficient data analysis and reporting (especially OLAP)". Generally,
data warehousing is not meant for current "live" data, although
'virtual' or 'point-to-point' data warehouses can access operational
data. A 'real' data warehouse is generally preferred to a virtual DW
because stored data has been validated and is set up to provide reliable
results to common types of queries used in a business.
A2. Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources.
Typical
relational databases are designed for on-line transactional processing
(OLTP) and do not meet the requirements for effective on-line analytical
processing (OLAP). As a result, data warehouses are designed
differently than traditional relational databases.
Q. What is ODS?
1. ODS means Operational Data Store.
2.
A collection of operation or bases data that is extracted from
operation databases and standardized, cleansed, consolidated,
transformed, and loaded into an enterprise data architecture. An ODS is
used to support data mining of operational data, or as the store for
base data that is summarized for a data warehouse. The ODS may also be
used to audit the data warehouse to assure summarized and derived data
is calculated properly. The ODS may further become the enterprise shared
operational database, allowing operational systems that are being
reengineered to use the ODS as there operation databases.
Q. What is a dimension table?
A.
A dimensional table is a collection of hierarchies and categories along
which the user can drill down and drill up. it contains only the
textual attributes.
Q. What is a lookup table?
A.
A lookup table is the one which is used when updating a warehouse. When
the lookup is placed on the target table (fact table / warehouse) based
upon the primary key of the target, it just updates the table by
allowing only new records or updated records based on the lookup
condition.
Q. Why should you put your data warehouse on a different system than your OLTP system?
A1.
A OLTP system is basically " data oriented " (ER model) and not "
Subject oriented "(Dimensional Model) .That is why we design a separate
system that will have a subject oriented OLAP system...
Moreover
if a complex querry is fired on a OLTP system will cause a heavy
overhead on the OLTP server that will affect the daytoday business
directly.
A2.
The loading of a warehouse will likely consume a lot of machine
resources. Additionally, users may create querries or reports that are
very resource intensive because of the potentially large amount of data
available. Such loads and resource needs will conflict with the needs of
the OLTP systems for resources and will negatively impact those
production systems.
Q. What are Aggregate tables?
A.
Aggregate table contains the summary of existing warehouse data which
is grouped to certain levels of dimensions. Retrieving the required data
from the actual table, which have millions of records will take more
time and also affects the server performance. To avoid this we can
aggregate the table to certain required level and can use it. This
tables reduces the load in the database server and increases the
performance of the query and can retrieve the result very fastly.
Q. What is Dimensional Modeling? Why is it important?
A.
Dimensional Modeling is a design concept used by many data warehouse
designers to build their data warehouse. In this design model all the
data is stored in two types of tables - Facts table and Dimension table.
Fact table contains the facts/measurements of the business and the
dimension table contains the context of measurements i.e., the
dimensions on which the facts are calculated.
Q. Why is Data Modeling Important?
A.
Data modeling is probably the most labor intensive and time consuming
part of the development process. Why bother especially if you are
pressed for time? A common response by practitioners who write on the
subject is that you should no more build a database without a model than
you should build a house without blueprints.
The
goal of the data model is to make sure that the all data objects
required by the database are completely and accurately represented.
Because the data model uses easily understood notations and natural
language, it can be reviewed and verified as correct by the end-users.
The
data model is also detailed enough to be used by the database
developers to use as a "blueprint" for building the physical database.
The information contained in the data model will be used to define the
relational tables, primary and foreign keys, stored procedures, and
triggers. A poorly designed database will require more time in the
long-term. Without careful planning you may create a database that omits
data required to create critical reports, produces results that are
incorrect or inconsistent, and is unable to accommodate changes in the
user's requirements.
Q. What is data mining?
A.
Data mining is a process of extracting hidden trends within a
datawarehouse. For example an insurance dataware house can be used to
mine data for the most high risk people to insure in a certain
geographial area.
Q. What is ETL?
A. ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control parameter.
· Extraction
Take data from an external source and move it to the warehouse pre-processor database.
· Transformation
Transform data task allows point-to-point generating, modifying and transforming data.
· Loading
Load data task adds records to a database table in a warehouse.
Q. What does level of Granularity of a fact table signify?
A. Granularity
The
first step in designing a fact table is to determine the granularity of
the fact table. By granularity, we mean the lowest level of information
that will be stored in the fact table. This constitutes two steps:
Determine which dimensions will be included.
Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements
Q. What is the Difference between OLTP and OLAP?
A. Main Differences between OLTP and OLAP are:-
1. User and System Orientation
OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals.
OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives, analysis).
2. Data Contents
OLTP: manages current data, very detail-oriented.
OLAP:
manages large amounts data, provides facilities for
summarization and aggregation, stores information at different levels of
granularity to support decision making process.
3. Database Design
OLTP: adopts an entity relationship(ER) model and an application-oriented database design.
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design.
4. View
OLTP: focuses on the current data within an enterprise or department.
OLAP:
spans multiple versions of a database schema due to the evolutionary
process of an organization; integrates information from many
organizational locations and data stores
Q. What is SCD1 , SCD2 , SCD3?
A. SCD Stands for Slowly changing dimensions.
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information and current information.
Q. Why are OLTP database designs not generally a good idea for a Data Warehouse?
A.
Since in OLTP,tables are normalised and hence query response will be
slow for end user and OLTP doesnot contain years of data and hence
cannot be analysed.
Q. What is BUS Schema?
A. BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.
Q. What are the various Reporting tools in the Market?
A. 1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
Q. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
A.
1.Normalization is process for assigning attributes to
entities–Reducesdata redundancies–Helps eliminate data
anomalies–Produces controlledredundancies to link tables
2.Normalization
is the analysis offunctional dependency between attributes / data items
of userviews?It reduces a complex user view to a set of small andstable
subgroups of fields / relations
1NF:Repeating groups must beeliminated, Dependencies can be identified, All key attributesdefined,No repeating groups in table
2NF:
The Table is already in1NF,Includes no partial dependencies–No
attribute dependent on a portionof primary key, Still possible to
exhibit transitivedependency,Attributes may be functionally dependent on
non-keyattributes
3NF: The Table is already in 2NF, Contains no transitivedependencies
Q. What is Fact table?
A.
Fact Table contains the measurements or metrics or facts of business
process. If your business process is "Sales" , then a measurement of
this business process such as "monthly sales number" is captured in the
Fact table. Fact table also contains the foriegn keys for the dimension
tables.
Q. What are conformed dimensions?
A1.
Conformed dimensions mean the exact same thing with every possible fact
table to which they are joined Ex:Date Dimensions is connected all
facts like Sales facts,Inventory facts..etc
A2.
Conformed dimentions are dimensions which are common to the
cubes.(cubes are the schemas contains facts and dimension tables)
Consider
Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the
Facts and Dimensions here D1,D2 are the Conformed Dimensions
Q. What are the Different methods of loading Dimension tables?
A. Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.
Direct load:(Faster Loading)
All
the Constraints will be disabled. Data will be loaded directly.Later
the data will be checked against the table constraints and the bad data
won't be indexed.
Q. What is conformed fact?
A.
Conformed dimensions are the dimensions which can be used across
multiple Data Marts in combination with multiple facts tables
accordingly
Q. What are Data Marts?
A. Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
There are two types of Data Marts:
1.Independent
data marts – sources from data captured form OLTP system, external
providers or from data generated locally within a particular department
or geographic area.
2.Dependent data mart – sources directly form enterprise data warehouses.Q. What is a level of Granularity of a fact table?
A.
Level of granularity means level of detail that you put into the fact
table in a data warehouse. For example: Based on design you can decide
to put the sales data in each transaction. Now, level of granularity
would mean what detail are you willing to put for each transactional
fact. Product sales with respect to each minute or you want to aggregate
it upto minute and put that data.
Q. How are the Dimension tables designed?
A.
Most dimension tables are designed using Normalization principles upto
2NF. In some instances they are further normalized to 3NF.
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension (see more on this in the next section).
Q. What are non-additive facts?
A. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Q. What type of Indexing mechanism do we need to use for a typical datawarehouse?
A.
On the fact table it is best to use bitmap indexes. Dimension tables
can use bitmap and/or the other types of clustered/non-clustered,
unique/non-unique indexes.
To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.
Q. What Snow Flake Schema?
A.
Snowflake Schema, each dimension has a primary dimension table, to
which one or more additional dimensions can join. The primary dimension
table is the only table that can join to the fact table.
Q. What is real time data-warehousing?
A.
Real-time data warehousing is a combination of two things: 1) real-time
activity and 2) data warehousing. Real-time activity is activity that
is happening right now. The activity could be anything such as the sale
of widgets. Once the activity is complete, there is data about it.
Data
warehousing captures business activity data. Real-time data warehousing
captures business activity data as it occurs. As soon as the business
activity is complete and there is data about it, the completed activity
data flows into the data warehouse and becomes available instantly. In
other words, real-time data warehousing is a framework for deriving
information from data as the data becomes available.
Q. What are slowly changing dimensions?
A. SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
scd3: by adding new columns to target table we maintain historical information and current information
Q. What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables?
A. Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.
EX: Average daily balance
A fact table without numeric fact columns is called factless fact table.
Ex: Promotion Facts
While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures.
Q. Differences between star and snowflake schemas?
A. Star schema - all dimensions will be linked directly with a fat table.
Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables.
Q. What is a Star Schema?
A.
Star schema is a type of organising the tables such that we can
retrieve the result from the database easily and fastly in the warehouse
environment.Usually a star schema consists of one or more dimension
tables around a fact table which looks like a star,so that it got its
name.
Q. What is a general purpose scheduling tool?
A.
The basic purpose of the scheduling tool in a DW Application is to
stream line the flow of data from Source To Target at specific time or
based on some condition.
Q. What is ER Diagram?
A.
The Entity-Relationship (ER) model was originally proposed by Peter in
1976 [Chen76] as a way to unify the network and relational database
views.
Simply
stated the ER model is a conceptual data model that views the real
world as entities and relationships. A basic component of the model is
the Entity-Relationship diagram which is used to visually represents
data objects.
Since
Chen wrote his paper the model has been extended and today it is
commonly used for database design For the database designer, the utility
of the ER model is:
it
maps well to the relational model. The constructs used in the ER model
can easily be transformed into relational tables. it is simple and easy
to understand with a minimum of training. Therefore, the model can be
used by the database designer to communicate the design to the end
user.
In
addition, the model can be used as a design plan by the database
developer to implement a data model in a specific database management
software.
Q. Which columns go to the fact table and which columns go the dimension table?
A. The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Foreign Keys.
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.
Q. What are modeling tools available in the Market?
A. Here are a number of data modeling tools
Tool Name Company Name
Erwin Computer Associates
Embarcadero Embarcadero Technologies
Rational Rose IBM Corporation
Power Designer Sybase Corporation
Oracle Designer Oracle Corporation
Q. Name some of modeling tools available in the Market?
A. These tools are used for Data/dimension modeling
1. Oracle Designer
2. ERWin (Entity Relationship for windows)
3. Informatica (Cubes/Dimensions)
4. Embarcadero
5. Power Designer Sybase
Q. How do you load the time dimension?
A.
Time dimensions are usually loaded by a program that loops through all
possible dates that may appear in the data. It is not unusual for 100
years to be represented in a time dimension, with one row per day.
Q. Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would you put your TX logs.
A.
Transaction logs write sequentially and don't need to be read at all.
The ideal is to have each on RAID 1/0 because it has much better write
performance than RAID 5.
RAID
1 is also better for TX logs and costs less than 1/0 to implement. It
has a tad less reliability and performance is a little worse generally
speaking.
RAID 5 is best for data generally because of cost and the fact it provides great read capability.
Q. What are the vaious ETL tools in the Market?
A. Various ETL tools used in market are:
1. Informatica
2. Data Stage
3. MS-SQL DTS(Integrated Services 2005)
4. Abinitio
5. SQL Loader
6. Sunopsis
7. Oracle Warehouse Bulider
8. Data Junction
Q. What is VLDB?
A1. VLDB stands for Very Large DataBase.
It
is an environment or storage space managed by a relational database
management system (RDBMS) consisting of vast quantities of information.
A2.
VLDB doesn’t refer to size of database or vast amount of information
stored. It refers to the window of opportunity to take back up the
database.
Window
of opportunity refers to the time of interval and if the DBA was unable
to take back up in the specified time then the database was considered
as VLDB.
Q. What are Data Marts?
A.
A data mart is a focused subset of a data warehouse that deals with a
single area(like different department) of data and is organized for
quick analysis
Q. What are the steps to build the datawarehouse ?
A. Gathering bussiness requiremnts
Identifying Sources
Identifying Facts
Defining Dimensions
Define Attribues
Redefine Dimensions & Attributes
Organise Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional convetions:Cardinality/Adding ratios
Q. What is Difference between E-R Modeling and Dimentional Modeling.?
A. Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model.
E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.
Q. Why fact table is in normal form?
A. Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures.
so whenever we have the keys in a table .that itself implies that the table is in the normal form.
Q. What are the advantages data mining over traditional approaches?
A.
Data Mining is used for the estimation of future. For example, if we
take a company/business organization, by using the concept of Data
Mining, we can predict the future of business interms of Revenue (or)
Employees (or) Cutomers (or) Orders etc.
Traditional
approches use simple algorithms for estimating the future. But, it does
not give accurate results when compared to Data Mining.
Q. What are the vaious ETL tools in the Market?
Various ETL tools used in market are:
Informatica
Data Stage
Oracle Warehouse Bulider
Ab Initio
Data Junction
Q. What is a CUBE in datawarehousing concept?
A.
Cubes are logical representation of multidimensional data.The edge of
the cube contains dimension members and the body of the cube contains
data values.
Q. What is data validation strategies for data mart validation after loading process ?
A. Data validation is to make sure that the loaded data is accurate and meets the business requriments.
Strategies are different methods followed to meet the validation requriments
Q. What is the datatype of the surrgate key ?
A. Datatype of the surrgate key is either inteeger or numaric or number
Q. What is degenerate dimension table?
A.
Degenerate Dimensions : If a table contains the values, which r neither
dimesion nor measures is called degenerate dimensions.Ex : invoice
id,empno
Q. What is Dimensional Modelling?
A.
Dimensional Modelling is a design concept used by many data warehouse
desginers to build thier datawarehouse. In this design model all the
data is stored in two types of tables - Facts table and Dimension table.
Fact table contains the facts/measurements of the business and the
dimension table contains the context of measuremnets ie, the dimensions
on which the facts are calculated.
Q. What are the methodologies of Data Warehousing.?
A.
Every company has methodology of their own. But to name a few SDLC
Methodology, AIM methodology are stardadly used. Other methodologies are
AMM, World class methodology and many more.
Q. What is a linked cube?
A.
Linked cube in which a sub-set of the data can be analysed into great
detail. The linking ensures that the data in the cubes remain
consistent.
Q. What is the main difference between Inmon and Kimball philosophies of data warehousing?
A. Both differed in the concept of building teh datawarehosue..
According to Kimball ...
Kimball
views data warehousing as a constituency of Data marts. Data marts are
focused on delivering business objectives for departments in the
organization. And the data warehouse is a conformed dimension of the
data marts. Hence a unified view of the enterprise can be obtain from
the dimension modeling on a local departmental level.
Inmon
beliefs in creating a data warehouse on a subject-by-subject area
basis. Hence the development of the data warehouse can start with data
from the online store. Other subject areas can be added to the data
warehouse as their needs arise. Point-of-sale (POS) data can be added
later if management decides it is necessary.
i.e.,
Kimball--First DataMarts--Combined way ---Datawarehouse
Inmon---First Datawarehouse--Later----Datamarts
Q. What is Data warehosuing Hierarchy?
A. Hierarchies
Hierarchies
are logical structures that use ordered levels as a means of organizing
data. A hierarchy can be used to define data aggregation. For example,
in a time dimension, a hierarchy might aggregate data from the month
level to the quarter level to the year level. A hierarchy can also be
used to define a navigational drill path and to establish a family
structure.
Within
a hierarchy, each level is logically connected to the levels above and
below it. Data values at lower levels aggregate into the data values at
higher levels. A dimension can be composed of more than one hierarchy.
For example, in the product dimension, there might be two
hierarchies--one for product categories and one for product suppliers.
Dimension
hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view
different levels of granularity. This is one of the key benefits of a
data warehouse.
When
designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies
impose a family structure on dimension values. For a particular level
value, a value at the next higher level is its parent, and values at the
next lower level are its children. These familial relationships enable
analysts to access data quickly.
Levels
A
level represents a position in a hierarchy. For example, a time
dimension might have a hierarchy that represents data at the month,
quarter, and year levels. Levels range from general to specific, with
the root level as the highest or most general level. The levels in a
dimension are organized into one or more hierarchies.
Level Relationships
Level
relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the
parent-child relationship between the levels in a hierarchy.
Hierarchies
are also essential components in enabling more complex rewrites. For
example, the database can aggregate an existing sales revenue on a
quarterly base to a yearly aggregation when the dimensional dependencies
between quarter and year are known.
Q. What is the main differnce between schema in RDBMS and schemas in DataWarehouse....?
A. RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modelled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
Q. What is hybrid slowly changing dimension?
A. Hybrid SCDs are combination of both SCD 1 and SCD 2.
It
may happen that in a table, some columns are important and we need to
track changes for them i.e capture the historical data for them whereas
in some columns even if the data changes, we don't care.
For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.
Q. What are the different architecture of datawarehouse?
A. There are two main things
1. Top down - (bill Inmon)
2.Bottom up - (Ralph kimbol)
Q. What is incremintal loading?
Q. What is batch processing?
Q. What is crass reference table?
Q. What is aggregate fact table?
A. Incremental loading means loading the ongoing changes in the OLTP.
Aggregate table contains the [measure] values ,aggregated /grouped/summed up to some level of hirarchy.
Q. What is junk dimension? what is the difference between junk dimension and degenerated dimension?
A. Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension.
Degenerate
Dimension: Keeping the control information on Fact table ex: Consider a
Dimension table with fields like order number and order line number and
have 1:1 relationship with Fact table, In this case this dimension is
removed and the order information will be directly stored in a Fact
table inorder eliminate unneccessary joins while retrieving order
information..
Q. What are the possible data marts in Retail sales.?
A. Product information,sales information
Q. What is the definition of normalized and denormalized view and what are the differences between them?
A. Normalization is the process of removing redundancies.
Denormalization is the process of allowing redundancies.
Q. What is meant by metadata in context of a Datawarehouse and how it is important?
A.
Meta data is the data about data; Business Analyst or data modeler
usually capture information about data - the source (where and how the
data is originated), nature of data (char, varchar, nullable, existance,
valid values etc) and behavior of data (how it is modified / derived
and the life cycle ) in data dictionary a.k.a metadata. Metadata is also
presented at the Datamart level, subsets, fact and dimensions, ODS etc.
For a DW user, metadata provides vital information for analysis / DSS.
Q. Differences between star and snowflake schemas?
A. Star schema
A single fact table with N number of Dimension
Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema
Q. Difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true?
A. Star schema contains the dimesion tables mapped around one or more fact tables.
It is a denormalised model.
No need to use complicated joins.
Queries results fastly.
Snowflake schema
It is the normalised form of Star schema.
contains indepth joins ,bcas the tbales r splitted in to many pieces.We can easily do modification directly in the tables.
We hav to use comlicated joins ,since we hav more tables .
There will be some delay in processing the Query .
Q. What is VLDB?
A.
The perception of what constitutes a VLDB continues to grow. A one
terabyte database would normally be considered to be a VLDB.
Q. What's the data types present in bo?n what happens if we implement view in the designer n report
A. Three different data types: Dimensions,Measure and Detail.
View is nothing but an alias and it can be used to resolve the loops in the universe.
Q. Can a dimension table contains numeric values?
A. Yes.But those datatype will be char (only the values can numeric/char)
Q. What is the difference between view and materialized view?
A.
View - store the SQL statement in the database and let you use it as a
table. Everytime you access the view, the SQL statement executes.
Materialized
view - stores the results of the SQL in table form in the database. SQL
statement only executes once and after that everytime you run the
query, the stored result set is used. Pros include quick query results.
Q. What is surrogate key ? where we use it expalin with examples
A1. surrogate key is a substitution for the natural primary key.
It
is just a unique identifier or number for each row that can be used for
the primary key to the table. The only requirement for a surrogate
primary key is that it is unique for each row in the table.
Data
warehouses typically use a surrogate, (also known as artificial or
identity key), key for the dimension tables primary keys. They can use
Infa sequence generator, or Oracle sequence, or SQL Server Identity
values for the surrogate key.
It
is useful because the natural primary key (i.e. Customer Number in
Customer table) can change and this makes updates more difficult.
Some
tables have columns such as AIRPORT_NAME or CITY_NAME which are stated
as the primary keys (according to the business users) but ,not only can
these change, indexing on a numerical value is probably better and you
could consider creating a surrogate key called, say, AIRPORT_ID. This
would be internal to the system and as far as the client is concerned
you may display only the AIRPORT_NAME.
A2. Adapted from response by Vincent on Thursday, March 13, 2003
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On
the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1'
(that's what would be in your Employee Dimension). This employee has a
turnover allocated to him on the Business Unit 'BU1' But on the 2nd of
June the Employee 'E1' is muted from Business Unit 'BU1' to Business
Unit 'BU2.' All the new turnover have to belong to the new Business Unit
'BU2' but the old one should Belong to the Business Unit 'BU1.'
If
you used the natural business key 'E1' for your employee within your
datawarehouse everything would be allocated to Business Unit 'BU2' even
what actualy belongs to 'BU1.'
If
you use surrogate keys, you could create on the 2nd of June a new
record for the Employee 'E1' in your Employee Dimension with a new
surrogate key.
This
way, in your fact table, you have your old data (before 2nd of June)
with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of
June) would take the SID of the employee 'E1' + 'BU2.'
You
could consider Slowly Changing Dimension as an enlargement of your
natural key: natural key of the Employee was Employee Code 'E1' but for
you it becomes
Employee
Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key enlargement process, is that you might not have
all part of your new key within your fact table, so you might not be
able to do the join on the new enlarge key -> so you need another id.
Q. What is ER Diagram?
The
Entity-Relationship (ER) model was originally proposed by Peter in 1976
[Chen76] as a way to unify the network and relational database views.
Simply
stated the ER model is a conceptual data model that views the real
world as entities and relationships. A basic component of the model is
the Entity-Relationship diagram which is used to visually represents
data objects.
Since
Chen wrote his paper the model has been extended and today it is
commonly used for database design For the database designer, the utility
of the ER model is:
it
maps well to the relational model. The constructs used in the ER model
can easily be transformed into relational tables. it is simple and easy
to understand with a minimum of training. Therefore, the model can be
used by the database designer to communicate the design to the end
user.
In
addition, the model can be used as a design plan by the database
developer to implement a data model in a specific database management
software.
Q. What is aggregate table and aggregate fact table ... any examples of both?
A. Aggregate table contains summarised data. The materialized view are aggregated tables.
for
ex in sales we have only date transaction. if we want to create a
report like sales by product per year. in such cases we aggregate the
date vales into week_agg, month_agg, quarter_agg, year_agg. to retrive
date from this tables we use @aggrtegate function.
Q. What is active data warehousing?
A.
An active data warehouse provides information that enables
decision-makers within an organization to manage customer relationships
nimbly, efficiently and proactively. Active data warehousing is all
about integrating advanced decision support with day-to-day-even
minute-to-minute-decision making in a way that increases quality of
those customer touches which encourages customer loyalty and thus secure
an organization's bottom line. The marketplace is coming of age as we
progress from first-generation "passive" decision-support systems to
current- and next-generation "active" data warehouse implementations
Q. Why do we override the execute method is struts? Plz give me the details?
A.
As part of Struts FrameWork we can decvelop the Action
Servlet,ActionForm servlets(here ActionServlet means which class extends
the Action class is called ActionServlet and ActionFome means which
calss extends the ActionForm calss is called the Action Form servlet)and
other servlets classes.
In
case of ActionForm class we can develop the validate().this method will
return the ActionErrors object.In this method we can write the
validation code.If this method return null or ActionErrors with
size=0,the webcontainer will call the execute() as part of the Action
class.if it returns size > 0 it willnot be call the execute().it will
execute the jsp,servlet or html file as value for the input attribute
as part of the attribute in struts-config.xml file.
Q. What is the difference between Datawarehousing and BusinessIntelligence?
A.
Data warehousing deals with all aspects of managing the development,
implementation and operation of a data warehouse or data mart including
meta data management, data acquisition, data cleansing, data
transformation, storage management, data distribution, data archiving,
operational reporting, analytical reporting, security management,
backup/recovery planning, etc. Business intelligence, on the other hand,
is a set of software tools that enable an organization to analyze
measurable aspects of their business such as sales performance,
profitability, operational efficiency, effectiveness of marketing
campaigns, market penetration among certain customer groups, cost
trends, anomalies and exceptions, etc. Typically, the term “business
intelligence” is used to encompass OLAP, data visualization, data mining
and query/reporting tools.Think of the data warehouse as the back
office and business intelligence as the entire business including the
back office. The business needs the back office on which to function,
but the back office without a business to support, makes no sense.
Q. What is the difference between OLAP and datawarehosue?
A. Datawarehouse is the place where the data is stored for analyzing
where as OLAP is the process of analyzing the data,managing aggregations,
partitioning information into cubes for indepth visualization.
Q. What is fact less fact table? where you have used it in your project?
Factless table means only the key available in the Fact there is no mesures availalabl
Q. Why Denormalization is promoted in Universe Designing?
A.
In a relational data model, for normalization purposes, some lookup
tables are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table
called DIMENSION table for performance and slicing data.Due to this
merging of tables into one large Dimension table, it comes out of
complex intermediate joins. Dimension tables are directly joined to Fact
tables.Though, redundancy of data occurs in DIMENSION table, size of
DIMENSION table is 15% only when compared to FACT table. So only
Denormalization is promoted in Universe Desinging.
Q. What is the difference between ODS and OLTP?
A. ODS:- It is nothing but a collection of tables created in the Datawarehouse that maintains only current data
where
as OLTP maintains the data only for transactions, these are designed
for recording daily operations and transactions of a business
Q. What is the difference between datawarehouse and BI?
A.
Simply speaking, BI is the capability of analyzing the data of a
datawarehouse in advantage of that business. A BI tool analyzes the data
of a datawarehouse and to come into some business decision depending on
the result of the analysis.
Q. Is OLAP databases are called decision support system ??? true/false?
True
Q. Explain in detail about type 1, type 2(SCD), type 3 ?
A. Type-1
Most Recent Value
Type-2(full History)
i) Version Number
ii) Flag
iii) Date
Type-3
Current and one Perivies value
Q. What is snapshot?
A.
You can disconnect the report from the catalog to which it is attached
by saving the report with a snapshot of the data. However, you must
reconnect to the catalog if you want to refresh the data.
Q. What is the difference between datawarehouse and BI?
A.
Simply speaking, BI is the capability of analyzing the data of a
datawarehouse in advantage of that business. A BI tool analyzes the data
of a datawarehouse and to come into some business decision depending on
the result of the analysis.
Q. What are non-additive facts in detail?
A. A fact may be measure, metric or a dollar value. Measure and metric are non additive facts.
Dollar
value is additive fact. If we want to find out the amount for a
particular place for a particular period of time, we can add the dollar
amounts and come up with the total amount.
A
non additive fact, for eg measure height(s) for 'citizens by
geographical location' , when we rollup 'city' data to 'state' level
data we should not add heights of the citizens rather we may want to use
it to derive 'count'
Q. What is a data warehouse?
A. A data warehouse
is a collection of data marts representing historical data from
different operations in the company. This data is stored in a structure
optimized for querying and data analysis
as a data warehouse. Table design, dimensions and organization should
be consistent throughout a data warehouse so that reports or queries
across the data warehouse are consistent. A data warehouse can also be
viewed as a database for historical data from different functions within
a company.
Q. What is a data mart? A. A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse. Q. What are the benefits of data warehousing?
A. Data warehouses are designed to perform well with aggregate queries running on large amounts of data.
The
structure of data warehouses is easier for end users to navigate,
understand and query against unlike the relational databases primarily
designed to handle lots of transactions.
Data
warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different
databases with different structures.
Queries
that would be complex in very normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on
transaction systems.
Data
warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a
company.
Data warehousing is an efficient way to manage demand for lots of information from lots of users.
Data
warehousing provides the capability to analyze large amounts of
historical data for nuggets of wisdom that can provide an organization
with competitive advantage.
Q. What is OLAP?
A. OLAP stands for Online Analytical Processing.
It
uses database tables (fact and dimension tables) to enable
multidimensional viewing, analysis and querying of large amounts of
data. E.g. OLAP technology could provide management with fast answers to
complex queries on their operational data or enable them to analyze
their company's historical data for trends and patterns.
Q. What is OLTP?
A. OLTP stands for Online Transaction Processing.
OLTP
uses normalized tables to quickly record large amounts of transactions
while making sure that these updates of data occur in as few places as
possible. Consequently OLTP database are designed for recording the
daily operations and transactions of a business. E.g. a timecard system
that supports a large production environment must record successfully a
large number of updates during critical periods like lunch hour, breaks,
startup and close of work.
Q. What are dimensions?
A.
Dimensions are categories by which summarized data can be viewed. E.g. a
profit summary in a fact table can be viewed by a Time dimension
(profit by month, quarter, year), Region dimension (profit by country,
state, city), Product dimension (profit for product1, product2).
Q. What are fact tables?
A.
A fact table is a table that contains summarized numerical and
historical data (facts) and a multipart index composed of foreign keys
from the primary keys of related dimension tables.
Q. What are measures?
A.
Measures are numeric data based on columns in a fact table. They are
the primary data which end users are interested in. E.g. a sales fact
table may contain a profit measure which represents profit on each sale.
Q. What are aggregations?
A.
Aggregations are precalculated numeric data. By calculating and storing
the answers to a query before users ask for it, the query processing
time can be reduced. This is key in providing fast query performance in
OLAP.
Q. What are cubes?
A. Cubes are data processing
units composed of fact tables and dimensions from the data warehouse.
They provide multidimensional views of data, querying and analytical
capabilities to clients.
Q. What is the PivotTable® Service?
A.
This is the primary component that connects clients to the Microsoft®
SQL Server™ 2000 Analysis Server. It also provides the capability for
clients to create local offline cubes using it as an OLAP server.
PivotTable® Service does not have a user interface, the clients using
its services has to provide its user interface.
Q. What are offline OLAP cubes?
A.
These are OLAP cubes created by clients, end users or third-party
applications accessing a data warehouse, relational database or OLAP
cube through the Microsoft® PivotTable® Service. E.g. Microsoft® Excel™
is very popular as a client for creating offline local OLAP cubes from
relational databases for multidimensional analysis. These cubes have to
be maintained and managed by the end users who have to manually refresh
their data.
Q. What are virtual cubes?
A.
These are combinations of one or more real cubes and require no disk
space to store them. They store only the definitions and not the data of
the referenced source cubes. They are similar to views in relational
databases.
Q. What are MOLAP cubes?
A.
MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the
data aggregations and a copy of the fact data are stored in a
multidimensional structure on the Analysis Server computer. It is best
when extra storage space is available on the Analysis Server computer
and the best query performance is desired. MOLAP local cubes contain all
the necessary data for calculating aggregates and can be used offline.
MOLAP cubes provide the fastest query response time and performance but
require additional storage space for the extra copy of data from the
fact table.
Q. What are ROLAP cubes?
A.
ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of
data from the fact table is not made and the data aggregates are stored
in tables in the source relational database. A ROLAP cube is best when
there is limited space on the Analysis Server and query performance is
not very important. ROLAP local cubes contain the dimensions and cube
definitions but aggregates are calculated when they are needed. ROLAP
cubes requires less storage space than MOLAP and HOLAP cubes.
Q. What are HOLAP cubes?
A.
HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a combination
of the ROLAP and MOLAP cube characteristics. It does not create a copy
of the source data however, data aggregations are stored in a
multidimensional structure on the Analysis Server computer. HOLAP cubes
are best when storage space is limited but faster query responses are
needed.
Q. What is the approximate size of a data warehouse?
A.
You can estimate the approximate size of a data warehouse made up of
only fact and dimension tables by estimating the approximate size of the
fact tables and ignoring the sizes of the dimension tables.
To
estimate the size of the fact table in bytes, multiply the size of a
row by the number of rows in the fact table. A more exact estimate would
include the data types, indexes, page sizes, etc. An estimate of the
number of rows in the fact table is obtained by multiplying the number
of transactions per hour by the number of hours in a typical work day
and then multiplying the result by the number of days in a year and
finally multiply this result by the number of years of transactions
involved. Divide this result by 1024 to convert to kilobytes and by 1024
again to convert to megabytes.
E.g.
A data warehouse will store facts about the help provided by a
company’s product support representatives. The fact table is made of up
of a composite key of 7 indexes (int data type) including the primary
key. The fact table also contains 1 measure of time (datetime data type)
and another measure of duration (int data type). 2000 product incidents
are recorded each hour in a relational database. A typical work day is 8
hours and support is provided for every day in the year. What will be
approximate size of this data warehouse in 5 years?
First calculate the approximate size of a row in bytes (int data type = 4 bytes, datetime data type = 8 bytes):
size
of a row = size of all composite indexes (add the size of all indexes) +
size of all measures (add the size of all measures).
Size of a row (bytes) = (4 * 7) + (8 + 4).
Size of a row (bytes) = 40 bytes.
Number of rows in fact table = (number of transactions per hour) * (8 hours) * (365 days in a year).
Number of rows in fact table = (2000 product incidents per hour) * (8 hours ) * (365 days in a year).
Number of rows in fact table = 2000 * 8 * 365
Number of rows in fact table = 5840000
Size of fact table (1 year) = (Number of rows in fact table) * (Size of a row)
Size of fact table (bytes per year) = 5840000 * 40
Size of fact table (bytes per year) = 233600000.
Size of fact table (megabytes per year) = 233600000 / (1024*1024)
Size of fact table (in megabytes for 5 years) = (23360000 * 5) / (1024 *1024)
Size of fact table (megabytes) = 1113.89 MB
Size of fact table (gigabytes) = 1113.89 / 1024
Size of fact table (gigabytes) = 1.089 GB
|
MIS BUSINESS INTELLIGENCE
Monday, October 1, 2012
Datawarehousing Interview Questions Part 1
Subscribe to:
Posts (Atom)