-
I am trying to write code that can count duplicate numbers in array for following sequences
1 2 1 2 5 2 found 1 times
1 2 2 2 5 2 found 2 times
1 2 3 4 5 not found
9 9 9 9 9 9 found 4 times
I am trying for first sequence
#include<stdio.h>
int main()
{
int array[5] = {1, 2, 1, 2, 5 };
for ( int i = 0; i < 5; i++)
{
for ( int j = i + 1; j < 5; j++)
{
if ( array[i] == array[j])
{
}
}
}
return 0;
}
How to count duplicates numbers in array
-
How about this.
You don't need to compare anything, just make another array with counts.
At the end of the loop you'll have the counts of each number.
You can check if the counter is <2, omitting the output as there were no duplicates.
uint8_t array[];
uint8_t count[10] = {0};
for (uint8_t i=0; i<sizeof(array); i++){
count[ array[i] ]++;
}
for (uint8_t i=0; i<10; i++){
printf("%u found %u times\n", i, count[i]);
}
Change uint8_t to uint16_t / uint32_t if your array is larger than 255 / 65535 elements.
Also, make sure your arrays don't contain anything else than 0-9.
-
For small arrays, two nested loops through the elements may be acceptable. For large/huge arrays, either sort the array in the first place and then find the counts with a single pass through the sorted elements, or set up a number -> count map. This map could be realized e.g. as a hash table, or if the set of possible numbers is a dense range (e.g. 1..9 as given in the example), then it could also be yet another count[] array, indexed by the number.
-
If you only have numbers between 0 and 9, the easiest would be to have a 10 value array and simply count each occurrence. A single for would be enough.
If you have a limited number of values (ex <=32 or <=64) you could use a 32 bit or 64 bit variable to keep track of which value was detected twice, and you could do that by setting individual bits on or off. Then you could have a 32 value or 64 value or (whatever limited amount) value array to keep the count.
If you're memory constrained and the original array doesn't have to remain intact and you have lots of numbers, then an option would be to sort the array, then you can use a simple for loop to go through the sorted array and easily determine how many times each value shows up. You could even keep track of the original position in array by including it in the value ... for example if you have an array with 1000 random values, you know that up to 1024 numbers can be stored in 10 bits so you could simply shift the original value in the array by 10 bits to the left and then add in those 10 bits the offset in the array, so that after the sort and during sort, you put the ones with smaller offset closer to start of array.
Of course, that's provided your values won't overflow if you shift bits that way.
-
I ran code on my system but It doesn't print anything
#include<stdio.h>
int main()
{
int array[5] = {1, 2, 1, 2, 5 };
int count[10] = {0};
for (int i=0; i<sizeof(array); i++)
{
count[ array[i] ]++;
}
for (int i=0; i<10; i++)
{
printf("%u found %u times\n", i, count[i]);
}
return 0;
}
-
sizeof(array) is the size of array in bytes, not the number of elements.
Anyway, the 2nd loop should print 10 lines to stdout.
EDIT: The first loop may crash the program, when accesing elements beyond the end of the erray. A premature crash could be the reason why you don't see anything.
-
It's exceedingly likely that one of the array elements beyond the end of array[5] has a value other than [0-9] and you're stomping on many random memory locations off the end of the count array.
For now, hard code the length to be 5.
Make sure that works.
Immediately after that, learn the (sizeof(array) / sizeof(array[0])) pattern.
-
When using arrays of integers larger than byte, simply do:
sizeof(array)/sizeof(array[0])
This way you always get the number of elements in the array.
Demo: https://jdoodle.com/ia/Cmy
-
Try this:
#include <stdio.h>
#define countof(a) (sizeof(a)/sizeof(a[0]))
int array[] = {1, 2, 1, 2, 5, 2, 7};
int counts[countof(array)];
int countDups(){
int iMaxDup=0;
int maxDup=0;
for ( int i = 0; i < countof(array); i++){
if(!counts[i]){
counts[i]=1; // count self
for ( int j = i + 1; j < countof(array); j++){
if ( array[i] == array[j]){
counts[i]++; // count duplicate
counts[j]=-1; // flag dont count again
if(counts[i]>maxDup){
iMaxDup=i;// most so far
maxDup=counts[i];
}
}
}
}
if(counts[i]==-1){
counts[i]=0; // 'dont count' flags become zero
}
}
return iMaxDup;
}
void arrPrint(int arr[], int mx){ // print an array
printf("{");
for(int i=0; i<mx-1; i++){
printf("%d, ", arr[i]);
}
printf("%d}", arr[mx-1]);
}
int main(int argc, char *argv[]){
int idx;
printf("Array: "), arrPrint(array, countof(array)), printf("\n");
idx=countDups();
printf("Counts: "), arrPrint(counts, countof(counts)), printf("\n");
printf("There are %d duplicates of %d (element %d)\n", counts[idx], array[idx], idx);
return 0;
}
It reports the element with the highest duplicate count. If more than one share the same count, it reports the first. It also prints the array and the counts for diagnostic purposes. Duplicate elements are not counted again after the first occurance.
-
Kittu20: as gf has noted, the array should be sorted and lengths of runs of identical values should be counted. I am repeating this for a reason. It may not be obvious, but the solutions with an array of couters are also using that approach. That’s counting sort (https://en.wikipedia.org/wiki/Counting_sort) with the final step optimized out.(1) The importance of this information? The algorithm retains all features of a sorting algorithm, including time and space complexities and access patterns.
Also: this is an algorithmic question, I believe, but in general while posting code please tell us the language.(2)
(1) Later run length counting operation would be inverse of that final step, so they cancel each other.
(2) To people, who would like to point that it is “obvious”: helping on programming fora taught me that it’s not as obvious, introducing unnecessary misunderstandings. In particular newbies may be blissfully unaware of differences between related languages (like C and C++), similarly named languages (e.g. Java and JavaScript), or somewhat visually similar languages (C fed to Java compiler, Perl attempts in PHP, …).
-
If the numbers are all in a relatively small interval, you can define a count array directly indexed by the numbers and do it in O(n).
If the interval is too large, this becomes unpractical, and sorting the array first and then counting successive identical numbers will probably be your best bet. If the array is very small, you can use any sorting algorithm without much difference, but if it's larger, use qsort or similar and you'll get something in O(n log n) typical (but O(n^2) worst case).
An alternative if the numbers can be large, but the number of distinct numbers is limited, would be to use the first approach, index a count array, with a hashed version of the numbers instead of the numbers directly.
-
umm, yup, for example b+tree with "no collision" can solve that, and it scales o(log(n)).
Just, it's much more complex to be implemented and tested.
-
Kittu20: as gf has noted, the array should be sorted and lengths of runs of identical values should be counted. I am repeating this for a reason. It may not be obvious, but the solutions with an array of couters are also using that approach. That’s counting sort (https://en.wikipedia.org/wiki/Counting_sort) with the final step optimized out.(1) The importance of this information? The algorithm retains all features of a sorting algorithm, including time and space complexities and access patterns.
I disagree with this assertion.
A generic sort algorithm (such as quicksort) has a lower bound on time and complexity of n log(n), which can grow quite large.
A counting sort (or radix sort), on the other hand, can have a lower complexity, much less than n log(n). So it can be much faster, but it can only be used in special cases.
Therefore, the advice to "just sort the array first and then count duplicates because it is just as fast as any other way" is not entirely accurate. Sorting the array first is a good strategy for simplicity of implementation and understanding, but it is not necessarily going to be the fastest way for very large arrays with unknown contents.
-
I am trying to write code that can count duplicate numbers in array for following sequences
1 2 1 2 5 2 found 1 times
1 2 2 2 5 2 found 2 times
1 2 3 4 5 not found
9 9 9 9 9 9 found 4 times
I have read the original post many times and can't see what defines a duplicate. Take for example the first and second entries. Why doesn't the second case have "3 times" instead of 2 times. If one numbers the places L-->R a,b,c,d,e
case 1 found 2 duplicated 1 time (ie.., b & d), even though they are separated by a non-2. Of course, 1 is also duplicated, but ignore that for the time being.
case 2 found 2 duplicated 2 times, which I presume are positions b&c + c&d. Why isn't b&d also a duplicate? In other words, in case 2, the middle 2 is counted as part of 2 duplicates. Why can't the leading and terminating 2 's be considered part of duplicates?
case 3 counts the three middle 9's as part of duplicates twice, but not the leading and trailing ones.
In other words, what is the rule (s) for a duplicate.
EDIT: I am fine if the definition is occurances -1.
EDIT2: @mariush I didn't see your comment before my first edit. I was thinking in terms of pairs.
-
I read it as "first time a value is encountered, look forward and see if that value shows up and how many times it shows up"
-
If you want that slightly perverse form of 'count duplicates' simply remove the line
counts[i]=1; // count self
from the countDups() function in my code above. However I think there is some elegance in having a unique value count as 1, then using a threshold of 2 for duplicate detection, as then the sum of the count array will be equal to its number of elements, a useful check on the algorithm and implimentation.
-
Kittu20 has discovered a set of problems typically described using terms like "number of occurrences", "distribution" or "frequency", or "histogram".
There are many applicable algorithms that differ in their behaviour and requirements. For example, if we already have all the data in an array, we can use an offline algorithm; but if we get the data element by element and need the results on an ongoing basis, we use an online algorithm (https://en.wikipedia.org/wiki/Online_algorithm). (Don't let the terms 'online' and 'offline' confuse you: these are not related to the 'online' and 'offline' used for example network connectivity.)
Online algorithms for this generally use a data structure that stores not only the values to be counted, but the number of occurrences of that specific value.
Offline algorithms vary much more, especially depending on how many unique values you're interested in compared to the number of values you have.
Hash tables can be used when the order of the values is irrelevant; optimum hash table size is about twice the length of the input. Sorted data structures can be used if the data needs to be sorted anyway. And so on.
The methods described in previous messages in this thread show the array or "histogram" approach. Each array element corresponds to one value (or a range of values for a proper histogram, technically), describing the number of occurrences of that value. In the examples, the values shown are between 0 and 9, inclusive, so an array of ten possible values are used. Often, we do an initial loop of all known values to find the minimum and maximum, and allocate an array large enough to hold everything in between. (Note that in C on current 64-bit architectures, you can dynamically allocate an array large enough to describe a 32-bit count for each possible 32-bit integer (or floating-point) value. Such an array is 16 GiB in size, though. If you try to declare such an array as a static (global) or local variable, it will usually fail; you do need to use malloc(), calloc(), or realloc() to dynamically allocate the memory for such an array.)
As always, there are tradeoffs.
In general, for very short arrays, say up to a dozen or two entries, the two nested loop approach will likely be just as fast as the array option, because the number of loop iterations is so small: for N=16, N²=256, and 256 iterations of a simple loop body will be ridiculously fast anyway.
The exact cutoff, be it 5 or 16 or whatever, varies depending on the machine and even on the array item type.
For arrays where the range of values is shorter than say twice the number of elements in the array, the array or histogram approach tends to be the fastest.
For arrays where the range of values is larger than say twice the number of elements in the array, a hash table preallocated to about twice the number of elements in the array (to keep the access time short, by ensuring the fill level is at most 50%), tends to be the fastest.
When one wants to use as little memory as possible, the two nested loop approach is perfectly acceptable; it will just become slower proportionally more with larger arrays.
I could show examples of a generic histogram/occurrences data structure and the three approaches for using them, and a single function call interface that selects which one is best used for the specified data (via the above heuristic, with the exact limits set by compile-time constants), but I suspect it would be too much too soon for Kittu20, and probably not that interesting to others. So, instead, I'll just leave this as a wall-of-text post. Apologies for that.
-
I disagree with this assertion. (…)
Emphasis added:as gf has noted, the array should be sorted and lengths of runs of identical values should be counted. I am repeating this for a reason. It may not be obvious, but the solutions with an array of couters are also using that approach. That’s counting sort (https://en.wikipedia.org/wiki/Counting_sort) with the final step optimized out.(1) The importance of this information? The algorithm retains all features of a sorting algorithm, including time and space complexities and access patterns.
-
I disagree with this assertion. (…)
Emphasis added:as gf has noted, the array should be sorted and lengths of runs of identical values should be counted. I am repeating this for a reason. It may not be obvious, but the solutions with an array of couters are also using that approach. That’s counting sort (https://en.wikipedia.org/wiki/Counting_sort) with the final step optimized out.(1) The importance of this information? The algorithm retains all features of a sorting algorithm, including time and space complexities and access patterns.
Precisely. You cannot use a counting sort as a general sort algorithm, only as a special case sort algorithm. So if your list is not suitable for a counting sort, you will need to prefer a different algorithm.
So the algorithm does not: "retain all features of a sorting algorithm, including time and space complexities and access patterns"
If your list does not meet the criteria for a counting sort to be efficient, then sorting first will be much slower than simply counting the duplicates.
You can refer to Nominal Animal's post above for a more detailed explanation.
-
Precisely. You cannot use a counting sort as a general sort algorithm, only as a special case sort algorithm.
Yes, exactly:as gf has noted, the array should be sorted and lengths of runs of identical values should be counted. I am repeating this for a reason. It may not be obvious, but the solutions with an array of couters are also using that approach. That’s counting sort (https://en.wikipedia.org/wiki/Counting_sort) with the final step optimized out.(1) The importance of this information? The algorithm retains all features of a sorting algorithm, including time and space complexities and access patterns.
(In case someone seen the rest of this response: I removed it as pointless. The original message talked only about a specific case, so no need for me to dig into other situations. And while the statement seems extendable to other cases, explaining it would only make things noisy to the OP, is not trivial and requires shifting a pre-learned conceptual view — a task almost impossible in confrontational environment)
-
If your list does not meet the criteria for a counting sort to be efficient, then sorting first will be much slower than simply counting the duplicates.
The only “simply counting the duplicates when a counting sort is not efficient” algorithm I can readily come up is O(N^2) in time, while O(N * log N) is typical for an efficient general sort. Am I missing an efficient “simply count duplicates” algorithm?
-
You guys love overcomplicating your homework solutions >:D
Here's 16 lines of code, linear time, no additional memory 8)
#include <limits.h>
#include <stdio.h>
int main() {
int array[5] = {1, 2, 1, 2, 5 };
for (int i = INT_MIN; ; i++) {
size_t count = 0;
for (size_t j = 0; j < sizeof(array)/sizeof(array[0]); j++)
if (array[j] == i)
count++;
if (count > 1)
printf("%d repeated %zu times\n", i, count - 1);
if (i == INT_MAX)
break;
}
}
:popcorn:
edit
Sorry, I'm retarded. It's fixed now ::)
-
The only “simply counting the duplicates when a counting sort is not efficient” algorithm I can readily come up is O(N^2) in time, while O(N * log N) is typical for an efficient general sort. Am I missing an efficient “simply count duplicates” algorithm?
"Simply counting" becomes O(N) -- i.e. single pass through the array -- if you manage to locate the counter variable for each number with O(1) complexity.
-
You guys love overcomplicating your homework solutions >:D
Here's 14 lines of code, linear time, no additional memory 8)
#include <limits.h>
#include <stdio.h>
int main() {
int array[5] = {1, 2, 1, 2, 2 };
for (int i = INT_MIN; i <= INT_MAX; i++) {
unsigned long count = 0;
for (size_t j = 0; j < sizeof(array)/sizeof(array[0]); j++)
if (array[j] == i)
count++;
if (count > 1)
printf("%d found %lu times\n", i, count - 1);
}
}
:popcorn:
Although likely slower in practice than most other approaches, complexity is even O(N) >:D
-
You quoted the buggy version which runs in infinite loop, there is a subtle difference now :P
(Actually, it triggered undefined behavior when it started to repeat itself, but it worked for me).
My program takes less than a minute to complete and it will get twice faster every two years :-DD
-
I am trying to write code that can count duplicate numbers in array for following sequences
1 2 1 2 5 2 found 1 times
1 2 2 2 5 2 found 2 times
1 2 3 4 5 not found
9 9 9 9 9 9 found 4 times
I have read the original post many times and can't see what defines a duplicate. Take for example the first and second entries. Why doesn't the second case have "3 times" instead of 2 times. If one numbers the places L-->R a,b,c,d,e
case 1 found 2 duplicated 1 time (ie.., b & d), even though they are separated by a non-2. Of course, 1 is also duplicated, but ignore that for the time being.
case 2 found 2 duplicated 2 times, which I presume are positions b&c + c&d. Why isn't b&d also a duplicate? In other words, in case 2, the middle 2 is counted as part of 2 duplicates. Why can't the leading and terminating 2 's be considered part of duplicates?
case 3 counts the three middle 9's as part of duplicates twice, but not the leading and trailing ones.
In other words, what is the rule (s) for a duplicate.
EDIT: I am fine if the definition is occurances -1.
EDIT2: @mariush I didn't see your comment before my first edit. I was thinking in terms of pairs.
Quite right, the OP's question was really ill-defined. Many of us have just taken it as counting *all* duplicates in some array, but the way it was formulated, it's absolutely as clear as mud.
If you can't ask the right question, finding the answer is a random process.
-
If your list does not meet the criteria for a counting sort to be efficient, then sorting first will be much slower than simply counting the duplicates.
The only “simply counting the duplicates when a counting sort is not efficient” algorithm I can readily come up is O(N^2) in time, while O(N * log N) is typical for an efficient general sort. Am I missing an efficient “simply count duplicates” algorithm?
You can use a hash table to store the counts. Since hash table accesses are nominally O(1), that should have an average time O(N), but the worst case behavior is O(N^2) for a basic hash table. However, its usually not desirable to worry about asymptotic log(N) terms -- since for practical systems log(N) is always limited to a pretty small number the constant factor is often as important as the difference between O(N) and O(N lg N). When you are worrying about that it often makes sense to count operations exactly rather than take an asymptotic limit.
-
Hash table is a great answer that I missed. Thanks!
-
Good, just ... hash tabs usually have collisions probability.
Need to be carefully setup to avoid/minimize.
Hash can use ROTR and ROTL :D
-
Good, just ... hash tabs usually have collisions probability.
Like I said in #16 (https://www.eevblog.com/forum/programming/counting-duplicates-numbers-in-array/msg4642456/#msg4642456), optimum size is about twice the input array size. This assumes you use the common scheme of probing [H%N] and if it is occupied but non-matching, [(H+D)%N], [(H+2*D)%N], [(H+3*D)%N], and so on, until either an unused slot or the matching value is found. (H being the hash of the value, N the hash table size, and D the probe step size, often 1.)
Obviously, to avoid the costly modulo operation per probe, one should make N a power of two about twice as large as the array size.
Really, a generic function to count the number of occurrences in an array, needs all three: nested-loop O(N²), range histogram, and a hash table.
If the array is short, the nested loop one makes most sense, because the iterations are fast, and the total number of iterations for small N, say up to a dozen perhaps, is faster to do than the alternatives. Otherwise, do a pass over the array entries, and find out the continuous range of values. If it is smaller than say 4× the number of entries in the array, you allocate an array of counters large enough for each possible value, and do the range histogram. Otherwise, you allocate a hash table of about 2× the number of entries in the array (noting that each hash table entry contains both the original value, and the count), and do the hash table approach.
Is it worth it? I dunno. I don't think so. But knowing the three approaches is useful, because ones use cases tend to fall into one of the three.
It's like with radix-sorting IEEE 754 double-precision numbers. It is quite straightforward: you just need to XOR-mask the high bit if unset, and all bits if set, so that when interpreted as an unsigned 64-bit integer, the values sort exactly like their original finite double-precision values would. Redo afterwards to return the original values. The optimal pass sizes do depend on the cache architecture, and although it does scale as O(N), you need bloody huge arrays, tens of millions to billions of doubles, before you beat the traditional O(N log N) sort algorithms. In many cases, by doing the XOR-mask pass before and after, and treating the doubles as 64-bit integers, you can speed up the sort enough that the amount of data at which the difference would matter to a human, is too large to worry about: the code maintenance cost is more important in practice.
-
Good, just ... hash tabs usually have collisions probability.
Need to be carefully setup to avoid/minimize.
Sure. Keeping hash tables efficient under different working conditions is one of the most well studied problems in computing. It's certainly possible to screw up, but usually possible to do correctly.